Mailman 3 Add a .whitespace property to module unicodedata - Python-ideas - python.org

newer
New Ideas for Python - Oregon...

Add a .whitespace property to module unicodedata

older
extend method of the list class...

Antonio Carlos Jorge Patricio

June 1, 2023

9:06 a.m.

I suggest including a simple str variable in unicodedata module to mirror string.whitespace, so it would contain all characters defined in CPython function [_PyUnicode_IsWhitespace()](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314) so that: # existent string.whitespace = ' \t\n\r\x0b\x0c' # proposed unicodedata.whitespace = ' \t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'

Reply

Sign in to reply online Use email software

Show replies by date

Paul Moore

June 2023

11:18 a.m.

On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio < antoniocjp@gmail.com> wrote:

What's the use case? I can't think of a single occasion when I would have found this useful. Paul

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

11:26 a.m.

It feels to me like "split on whitespace" or "remove whitespace" are quite common operations. I've been frustrated a number of times by settling for the ASCII whitespace class when I really wanted the Unicode whitespace class. On Thu, Jun 1, 2023 at 12:20 PM Paul Moore <p.f.moore@gmail.com> wrote:

-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

Reply

Sign in to reply online Use email software

Chris Angelico

12:07 p.m.

On Fri, 2 Jun 2023 at 02:27, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

They are indeed, quite common. It's a good thing Python makes those easy.

ChrisA

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

12:14 p.m.

OK, fair enough. What about "has whitespace (including Unicode beyond ASCII)"? On Thu, Jun 1, 2023 at 1:08 PM Chris Angelico <rosuav@gmail.com> wrote:

-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

Reply

Sign in to reply online Use email software

Paul Moore

2:38 p.m.

On Thu, 1 Jun 2023 at 18:16, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

OK, fair enough. What about "has whitespace (including Unicode beyond ASCII)"?

❯ py -m timeit -s "import re; r = re.compile(r'\s', re.U)" "r.search('ab\u2002cd')" 1000000 loops, best of 5: 262 nsec per loop Paul

Reply

Sign in to reply online Use email software

Marc-Andre Lemburg

12:28 p.m.

On 01.06.2023 18:18, Paul Moore wrote:

Same here. For those few cases, where it might be useful, you can easily put the string into your application code. Putting this into the stdlib would just mean that we'd have to recheck whether new Unicode whitespace chars were added, every time the standard upgrades. With ASCII, this won't happen in the foreseeable future ;-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jun 01 2023)

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

1:06 p.m.

I guess this is pretty general for the described need:

It's milliseconds not nanoseconds, but presumably something you do once at the start of an application. Can anyone think of a more efficient and/or more concise way of doing this? This definitely feels better than making a static sequence of characters since the Unicode Consortium may (and has) changed the definition. In particular, MONGOLIAN VOWEL SEPARATOR (U+180E) was removed from the whitespace category to which it previously belonged. I'm not sure why U+FEFF isn't included, but that seems to match the current standards, so all good. On Thu, Jun 1, 2023 at 1:29 PM Marc-Andre Lemburg <mal@egenix.com> wrote:

On 01.06.2023 18:18, Paul Moore wrote:

...
On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio <antoniocjp@gmail.com <mailto:antoniocjp@gmail.com>> wrote:

I suggest including a simple str variable in unicodedata module to mirror string.whitespace, so it would contain all characters defined in CPython function [_PyUnicode_IsWhitespace()](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314 <https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314>) so that:

# existent string.whitespace = ' \t\n\r\x0b\x0c'

# proposed unicodedata.whitespace = ' \t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'

What's the use case? I can't think of a single occasion when I would have found this useful.

Same here.

For those few cases, where it might be useful, you can easily put the string into your application code.

Putting this into the stdlib would just mean that we'd have to recheck whether new Unicode whitespace chars were added, every time the standard upgrades. With ASCII, this won't happen in the foreseeable future ;-)

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Experts (#1, Jun 01 2023)

...
...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/REMDZ2... Code of Conduct: http://python.org/psf/codeofconduct/

-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

Reply

Sign in to reply online Use email software

Richard Damon

2:14 p.m.

On 6/1/23 2:06 PM, David Mertz, Ph.D. wrote:

I'm not sure why U+FEFF isn't included, but that seems to match the current standards, so all good.

I think because Zero Width, No-Breaking Space, (aka BOM Mark) doesn't act like a "Space" character. If used as the BOM mark, it is intended that it gets stripped out when read and the UTF-16/UTF-32 data file that follows it be typically just read and have its byte order corrected as the mark indicates. If used elsewhere as the ZWNBSP (which has been deprecated and replaced with U+2060) then it use is intentionally "no-break" so not a space to seperate on. -- Richard Damon

Reply

Sign in to reply online Use email software

Ethan Furman

5:47 p.m.

On 6/1/23 11:06, David Mertz, Ph.D. wrote:

I guess this is pretty general for the described need:

...
...
...
unicode_whitespace = [chr(c) for c in range(0xFFFF) if unicodedata.category(chr(c)) == "Zs"]

Using the module-level `__getattr__` that could be a lazy attribute. -- ~Ethan~

Reply

Sign in to reply online Use email software

Barry

3:17 p.m.

On 1 Jun 2023, at 19:10, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

%time unicode_whitespace = [chr(c) for c in range(0xFFFF) if unicodedata.category(chr(c)) == "Zs"]

Try 0x10ffff to get all of unicode. Barry

Reply

Sign in to reply online Use email software

Marc-Andre Lemburg

3:32 p.m.

On 01.06.2023 20:06, David Mertz, Ph.D. wrote:

I guess this is pretty general for the described need:

...
...
...
%time unicode_whitespace = [chr(c) for c in range(0xFFFF) if unicodedata.category(chr(c)) == "Zs"]

Use sys.maxunicode instead of 0xFFFF

CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms Wall time: 18.7 ms

...
...
...
unicode_whitespace [' ', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a', '\u202f', '\u205f', '\u3000']

It's milliseconds not nanoseconds, but presumably something you do once at the start of an application. Can anyone think of a more efficient and/or more concise way of doing this?

There isn't. You essentially have to scan the entire database for whitespacy chars.

This definitely feels better than making a static sequence of characters since the Unicode Consortium may (and has) changed the definition.

Which was my point: including the above in a stdlib module wouldn't make sense, since it increases module load time (and possibly startup time), so it's better to generate a string and put this verbatim into the application. However, this would have to be part of the Unicode database update dance and whitespace is only possible category of chars which would be interesting. Digits or numbers are another, letter, linebreaks, symbols, etc. others: https://www.unicode.org/reports/tr44/#GC_Values_Table It's better to put this into the application in question or to have someone maintain such collections outside the stdlib in a package on PyPI. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jun 02 2023)

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

3:52 p.m.

If we're talking PyPI, it would be nice to have: unicode_categories = {"Zs": [...], "Ll": [...], ...} For all the various categories. It would just take one pass through all the characters to generate it, but then every category would be fast to access later. On the other hand, it's a few lines of code with a lazy import. Probably not enough code to put on PyPI. On Fri, Jun 2, 2023 at 4:32 PM Marc-Andre Lemburg <mal@egenix.com> wrote:

On 01.06.2023 20:06, David Mertz, Ph.D. wrote:

...
I guess this is pretty general for the described need:

...
...
...
%time unicode_whitespace = [chr(c) for c in range(0xFFFF) if unicodedata.category(chr(c)) == "Zs"]

Use sys.maxunicode instead of 0xFFFF

...
CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms Wall time: 18.7 ms

...
...
...
unicode_whitespace [' ', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a', '\u202f', '\u205f', '\u3000']

It's milliseconds not nanoseconds, but presumably something you do once at the start of an application. Can anyone think of a more efficient and/or more concise way of doing this?

There isn't. You essentially have to scan the entire database for whitespacy chars.

...
This definitely feels better than making a static sequence of characters since the Unicode Consortium may (and has) changed the definition.

Which was my point: including the above in a stdlib module wouldn't make sense, since it increases module load time (and possibly startup time), so it's better to generate a string and put this verbatim into the application.

However, this would have to be part of the Unicode database update dance and whitespace is only possible category of chars which would be interesting. Digits or numbers are another, letter, linebreaks, symbols, etc. others:

https://www.unicode.org/reports/tr44/#GC_Values_Table

It's better to put this into the application in question or to have someone maintain such collections outside the stdlib in a package on PyPI.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Experts (#1, Jun 02 2023)

...
...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

Reply

Sign in to reply online Use email software

Chris Angelico

3:56 p.m.

On Sat, 3 Jun 2023 at 06:54, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

Question: What is the advantage of having this? What are the use-cases? ChrisA

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

4:08 p.m.

def does_string_have_currency_mark(s): return bool(set(s) & set(unicode_categories['Sc']) def does_string_have_numeric_digit(s): ... ... and so on. Those seem like questions one asks often enough. Not every day, but more than never. On Fri, Jun 2, 2023 at 4:59 PM Chris Angelico <rosuav@gmail.com> wrote:

-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

Reply

Sign in to reply online Use email software

Chris Angelico

4:17 p.m.

On Sat, 3 Jun 2023 at 07:08, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

These questions are much better answered with the unicodedata.category() function. First figure out what categories your string has: cats = set(unicodedata.category(ch) for ch in s) And then check whether Sc is in that set, or whatever others you care about. This way, the set contains only the categories, not the characters; there's no reason to do set intersection with all of the characters. ChrisA

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

4:28 p.m.

Sure. That's fine. With a sufficiently long strings my code is faster, but for "typical" strings yours will be. On Fri, Jun 2, 2023, 5:20 PM Chris Angelico <rosuav@gmail.com> wrote:

Reply

Sign in to reply online Use email software

Chris Angelico

4:34 p.m.

On Sat, 3 Jun 2023 at 07:28, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

Sure. That's fine. With a sufficiently long strings my code is faster, but for "typical" strings yours will be.

Really? How? Your code has to build a set of every character in the string; mine builds a set of every category in the string. Set intersection won't be slower for a smaller set. ChrisA

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

5:28 p.m.

This is just bar talk at this point. I think we've shown that this is easy enough to do that programmers can roll their own. But as idle chat goes, note that in your code: set(unicodedata.category(ch) for ch in s) If `s` is a billion characters long, then we make a billion calls to the `.category()` method. Python calls are comparatively expensive, even on well optimized data structures like strings. In my version: bool(set(s) & set(unicode_categories['Sc']) The billion characters are first reduced to a smallish set of hundreds or thousands of distinct characters without needing method calls. Then that is intersected with a smallish set of characters in the category. You could optimize your version, however, simply by using: set(unicodedata.category(set(ch)) for ch in s) Yours provides more information, since it lists all the categories. But if you REALLY only care about one category, then you still have to ask `'Sc' in set(unicodedata.category(set(ch)) for ch in s)`. Which is fine, that's not a hard question to ask. On Fri, Jun 2, 2023 at 5:36 PM Chris Angelico <rosuav@gmail.com> wrote:

-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

Reply

Sign in to reply online Use email software

Chris Angelico

6:26 p.m.

On Sat, 3 Jun 2023 at 08:28, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

Or perhaps: set(unicodedata.category(ch) for ch in set(s)) But measure before considering this worthwhile.

If you REALLY want to just check whether any category is there, you probably want something like: any(unicodedata.category(ch) == "Sc" for ch in s) which is completely different from what you were suggesting, and still doesn't require the string of all codepoints in the category. Point is, querying the string is almost always going to be more efficient than intersecting with the full gamut of that category. ChrisA

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

6:42 p.m.

Yeah... oops. Obviously I typed the version in email. Should have done it in the shell. But you got the intention of set-ifying the characters in the large string. Yes on lies, damn lies, and benchmarks. On Fri, Jun 2, 2023, 7:29 PM Chris Angelico <rosuav@gmail.com> wrote:

On Sat, 3 Jun 2023 at 08:28, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

...
This is just bar talk at this point. I think we've shown that this is easy enough to do that programmers can roll their own.

But as idle chat goes, note that in your code:

set(unicodedata.category(ch) for ch in s)

If `s` is a billion characters long, then we make a billion calls to the `.category()` method. Python calls are comparatively expensive, even on well optimized data structures like strings.

In my version:

bool(set(s) & set(unicode_categories['Sc'])

The billion characters are first reduced to a smallish set of hundreds or thousands of distinct characters without needing method calls. Then that is intersected with a smallish set of characters in the category.

You could optimize your version, however, simply by using:

set(unicodedata.category(set(ch)) for ch in s)

Or perhaps:

set(unicodedata.category(ch) for ch in set(s))

But measure before considering this worthwhile.

...
Yours provides more information, since it lists all the categories. But if you REALLY only care about one category, then you still have to ask `'Sc' in set(unicodedata.category(set(ch)) for ch in s)`. Which is fine, that's not a hard question to ask.

If you REALLY want to just check whether any category is there, you probably want something like:

any(unicodedata.category(ch) == "Sc" for ch in s)

which is completely different from what you were suggesting, and still doesn't require the string of all codepoints in the category.

Point is, querying the string is almost always going to be more efficient than intersecting with the full gamut of that category.

ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/KMHZOQ... Code of Conduct: http://python.org/psf/codeofconduct/

Reply

Sign in to reply online Use email software

Chris Angelico

6:45 p.m.

On Sat, 3 Jun 2023 at 09:42, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

Yeah... oops. Obviously I typed the version in email. Should have done it in the shell. But you got the intention of set-ifying the characters in the large string.

Yep. I thought of that as I was originally writing, but absent benchmarking data, I prefer the simplest way of writing something. ChrisA

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

7:12 p.m.

Let's call the styles a tie. Using the SOWPODS scrabble wordlist (no currency symbols, so False answer):

Of course, this is a small character set of 26 lowercase letters (and newline as I did it). A more diverse alphabet might tip the timing slightly, but it's going to be a small matter either way. On Fri, Jun 2, 2023 at 7:49 PM Chris Angelico <rosuav@gmail.com> wrote:

-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

Reply

Sign in to reply online Use email software

Chris Angelico

7:20 p.m.

On Sat, 3 Jun 2023 at 10:12, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

Remember though, the original request was not for a set, but for a string. Try your timing again when working with a string. The any() form is almost certainly the most effective, although I suppose it could be implemented in C for better performance (avoiding calling back into Python repeatedly). Not sure it's necessary though. ChrisA

Reply

Sign in to reply online Use email software

Paul Moore

June 2023

11:18 a.m.

On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio < antoniocjp@gmail.com> wrote:

What's the use case? I can't think of a single occasion when I would have found this useful. Paul

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

11:26 a.m.

It feels to me like "split on whitespace" or "remove whitespace" are quite common operations. I've been frustrated a number of times by settling for the ASCII whitespace class when I really wanted the Unicode whitespace class. On Thu, Jun 1, 2023 at 12:20 PM Paul Moore <p.f.moore@gmail.com> wrote:

-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

Reply

Sign in to reply online Use email software

Chris Angelico

12:07 p.m.

On Fri, 2 Jun 2023 at 02:27, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

They are indeed, quite common. It's a good thing Python makes those easy.

ChrisA

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

12:14 p.m.

OK, fair enough. What about "has whitespace (including Unicode beyond ASCII)"? On Thu, Jun 1, 2023 at 1:08 PM Chris Angelico <rosuav@gmail.com> wrote:

-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

Reply

Sign in to reply online Use email software

Paul Moore

2:38 p.m.

On Thu, 1 Jun 2023 at 18:16, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

OK, fair enough. What about "has whitespace (including Unicode beyond ASCII)"?

❯ py -m timeit -s "import re; r = re.compile(r'\s', re.U)" "r.search('ab\u2002cd')" 1000000 loops, best of 5: 262 nsec per loop Paul

Reply

Sign in to reply online Use email software

Marc-Andre Lemburg

12:28 p.m.

On 01.06.2023 18:18, Paul Moore wrote:

Same here. For those few cases, where it might be useful, you can easily put the string into your application code. Putting this into the stdlib would just mean that we'd have to recheck whether new Unicode whitespace chars were added, every time the standard upgrades. With ASCII, this won't happen in the foreseeable future ;-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jun 01 2023)

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

June 2023

1:06 p.m.

I guess this is pretty general for the described need:

It's milliseconds not nanoseconds, but presumably something you do once at the start of an application. Can anyone think of a more efficient and/or more concise way of doing this? This definitely feels better than making a static sequence of characters since the Unicode Consortium may (and has) changed the definition. In particular, MONGOLIAN VOWEL SEPARATOR (U+180E) was removed from the whitespace category to which it previously belonged. I'm not sure why U+FEFF isn't included, but that seems to match the current standards, so all good. On Thu, Jun 1, 2023 at 1:29 PM Marc-Andre Lemburg <mal@egenix.com> wrote:

On 01.06.2023 18:18, Paul Moore wrote:

...
On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio <antoniocjp@gmail.com <mailto:antoniocjp@gmail.com>> wrote:

I suggest including a simple str variable in unicodedata module to mirror string.whitespace, so it would contain all characters defined in CPython function [_PyUnicode_IsWhitespace()](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314 <https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314>) so that:

# existent string.whitespace = ' \t\n\r\x0b\x0c'

# proposed unicodedata.whitespace = ' \t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'

What's the use case? I can't think of a single occasion when I would have found this useful.

Same here.

For those few cases, where it might be useful, you can easily put the string into your application code.

Putting this into the stdlib would just mean that we'd have to recheck whether new Unicode whitespace chars were added, every time the standard upgrades. With ASCII, this won't happen in the foreseeable future ;-)

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Experts (#1, Jun 01 2023)

...
...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/REMDZ2... Code of Conduct: http://python.org/psf/codeofconduct/

-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

Reply

Sign in to reply online Use email software

Richard Damon

2:14 p.m.

On 6/1/23 2:06 PM, David Mertz, Ph.D. wrote:

I'm not sure why U+FEFF isn't included, but that seems to match the current standards, so all good.

I think because Zero Width, No-Breaking Space, (aka BOM Mark) doesn't act like a "Space" character. If used as the BOM mark, it is intended that it gets stripped out when read and the UTF-16/UTF-32 data file that follows it be typically just read and have its byte order corrected as the mark indicates. If used elsewhere as the ZWNBSP (which has been deprecated and replaced with U+2060) then it use is intentionally "no-break" so not a space to seperate on. -- Richard Damon

Reply

Sign in to reply online Use email software

Ethan Furman

5:47 p.m.

On 6/1/23 11:06, David Mertz, Ph.D. wrote:

I guess this is pretty general for the described need:

...
...
...
unicode_whitespace = [chr(c) for c in range(0xFFFF) if unicodedata.category(chr(c)) == "Zs"]

Using the module-level `__getattr__` that could be a lazy attribute. -- ~Ethan~

Reply

Sign in to reply online Use email software

Barry

3:17 p.m.

On 1 Jun 2023, at 19:10, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

%time unicode_whitespace = [chr(c) for c in range(0xFFFF) if unicodedata.category(chr(c)) == "Zs"]

Try 0x10ffff to get all of unicode. Barry

Reply

Sign in to reply online Use email software

Marc-Andre Lemburg

3:32 p.m.

On 01.06.2023 20:06, David Mertz, Ph.D. wrote:

I guess this is pretty general for the described need:

...
...
...
%time unicode_whitespace = [chr(c) for c in range(0xFFFF) if unicodedata.category(chr(c)) == "Zs"]

Use sys.maxunicode instead of 0xFFFF

CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms Wall time: 18.7 ms

...
...
...
unicode_whitespace [' ', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a', '\u202f', '\u205f', '\u3000']

It's milliseconds not nanoseconds, but presumably something you do once at the start of an application. Can anyone think of a more efficient and/or more concise way of doing this?

There isn't. You essentially have to scan the entire database for whitespacy chars.

This definitely feels better than making a static sequence of characters since the Unicode Consortium may (and has) changed the definition.

Which was my point: including the above in a stdlib module wouldn't make sense, since it increases module load time (and possibly startup time), so it's better to generate a string and put this verbatim into the application. However, this would have to be part of the Unicode database update dance and whitespace is only possible category of chars which would be interesting. Digits or numbers are another, letter, linebreaks, symbols, etc. others: https://www.unicode.org/reports/tr44/#GC_Values_Table It's better to put this into the application in question or to have someone maintain such collections outside the stdlib in a package on PyPI. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jun 02 2023)

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

3:52 p.m.

If we're talking PyPI, it would be nice to have: unicode_categories = {"Zs": [...], "Ll": [...], ...} For all the various categories. It would just take one pass through all the characters to generate it, but then every category would be fast to access later. On the other hand, it's a few lines of code with a lazy import. Probably not enough code to put on PyPI. On Fri, Jun 2, 2023 at 4:32 PM Marc-Andre Lemburg <mal@egenix.com> wrote:

On 01.06.2023 20:06, David Mertz, Ph.D. wrote:

...
I guess this is pretty general for the described need:

...
...
...
%time unicode_whitespace = [chr(c) for c in range(0xFFFF) if unicodedata.category(chr(c)) == "Zs"]

Use sys.maxunicode instead of 0xFFFF

...
CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms Wall time: 18.7 ms

...
...
...
unicode_whitespace [' ', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a', '\u202f', '\u205f', '\u3000']

It's milliseconds not nanoseconds, but presumably something you do once at the start of an application. Can anyone think of a more efficient and/or more concise way of doing this?

There isn't. You essentially have to scan the entire database for whitespacy chars.

...
This definitely feels better than making a static sequence of characters since the Unicode Consortium may (and has) changed the definition.

Which was my point: including the above in a stdlib module wouldn't make sense, since it increases module load time (and possibly startup time), so it's better to generate a string and put this verbatim into the application.

However, this would have to be part of the Unicode database update dance and whitespace is only possible category of chars which would be interesting. Digits or numbers are another, letter, linebreaks, symbols, etc. others:

https://www.unicode.org/reports/tr44/#GC_Values_Table

It's better to put this into the application in question or to have someone maintain such collections outside the stdlib in a package on PyPI.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Experts (#1, Jun 02 2023)

...
...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

Reply

Sign in to reply online Use email software

Chris Angelico

June 2023

3:56 p.m.

On Sat, 3 Jun 2023 at 06:54, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

Question: What is the advantage of having this? What are the use-cases? ChrisA

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

4:08 p.m.

def does_string_have_currency_mark(s): return bool(set(s) & set(unicode_categories['Sc']) def does_string_have_numeric_digit(s): ... ... and so on. Those seem like questions one asks often enough. Not every day, but more than never. On Fri, Jun 2, 2023 at 4:59 PM Chris Angelico <rosuav@gmail.com> wrote:

-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

Reply

Sign in to reply online Use email software

Chris Angelico

4:17 p.m.

On Sat, 3 Jun 2023 at 07:08, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

These questions are much better answered with the unicodedata.category() function. First figure out what categories your string has: cats = set(unicodedata.category(ch) for ch in s) And then check whether Sc is in that set, or whatever others you care about. This way, the set contains only the categories, not the characters; there's no reason to do set intersection with all of the characters. ChrisA

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

4:28 p.m.

Sure. That's fine. With a sufficiently long strings my code is faster, but for "typical" strings yours will be. On Fri, Jun 2, 2023, 5:20 PM Chris Angelico <rosuav@gmail.com> wrote:

Reply

Sign in to reply online Use email software

Chris Angelico

4:34 p.m.

On Sat, 3 Jun 2023 at 07:28, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

Sure. That's fine. With a sufficiently long strings my code is faster, but for "typical" strings yours will be.

Really? How? Your code has to build a set of every character in the string; mine builds a set of every category in the string. Set intersection won't be slower for a smaller set. ChrisA

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

5:28 p.m.

This is just bar talk at this point. I think we've shown that this is easy enough to do that programmers can roll their own. But as idle chat goes, note that in your code: set(unicodedata.category(ch) for ch in s) If `s` is a billion characters long, then we make a billion calls to the `.category()` method. Python calls are comparatively expensive, even on well optimized data structures like strings. In my version: bool(set(s) & set(unicode_categories['Sc']) The billion characters are first reduced to a smallish set of hundreds or thousands of distinct characters without needing method calls. Then that is intersected with a smallish set of characters in the category. You could optimize your version, however, simply by using: set(unicodedata.category(set(ch)) for ch in s) Yours provides more information, since it lists all the categories. But if you REALLY only care about one category, then you still have to ask `'Sc' in set(unicodedata.category(set(ch)) for ch in s)`. Which is fine, that's not a hard question to ask. On Fri, Jun 2, 2023 at 5:36 PM Chris Angelico <rosuav@gmail.com> wrote:

-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

Reply

Sign in to reply online Use email software

Chris Angelico

June 2023

6:26 p.m.

On Sat, 3 Jun 2023 at 08:28, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

Or perhaps: set(unicodedata.category(ch) for ch in set(s)) But measure before considering this worthwhile.

If you REALLY want to just check whether any category is there, you probably want something like: any(unicodedata.category(ch) == "Sc" for ch in s) which is completely different from what you were suggesting, and still doesn't require the string of all codepoints in the category. Point is, querying the string is almost always going to be more efficient than intersecting with the full gamut of that category. ChrisA

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

6:42 p.m.

Yeah... oops. Obviously I typed the version in email. Should have done it in the shell. But you got the intention of set-ifying the characters in the large string. Yes on lies, damn lies, and benchmarks. On Fri, Jun 2, 2023, 7:29 PM Chris Angelico <rosuav@gmail.com> wrote:

On Sat, 3 Jun 2023 at 08:28, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

...
This is just bar talk at this point. I think we've shown that this is easy enough to do that programmers can roll their own.

But as idle chat goes, note that in your code:

set(unicodedata.category(ch) for ch in s)

If `s` is a billion characters long, then we make a billion calls to the `.category()` method. Python calls are comparatively expensive, even on well optimized data structures like strings.

In my version:

bool(set(s) & set(unicode_categories['Sc'])

The billion characters are first reduced to a smallish set of hundreds or thousands of distinct characters without needing method calls. Then that is intersected with a smallish set of characters in the category.

You could optimize your version, however, simply by using:

set(unicodedata.category(set(ch)) for ch in s)

Or perhaps:

set(unicodedata.category(ch) for ch in set(s))

But measure before considering this worthwhile.

...
Yours provides more information, since it lists all the categories. But if you REALLY only care about one category, then you still have to ask `'Sc' in set(unicodedata.category(set(ch)) for ch in s)`. Which is fine, that's not a hard question to ask.

If you REALLY want to just check whether any category is there, you probably want something like:

any(unicodedata.category(ch) == "Sc" for ch in s)

which is completely different from what you were suggesting, and still doesn't require the string of all codepoints in the category.

Point is, querying the string is almost always going to be more efficient than intersecting with the full gamut of that category.

ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/KMHZOQ... Code of Conduct: http://python.org/psf/codeofconduct/

Reply

Sign in to reply online Use email software

Chris Angelico

6:45 p.m.

On Sat, 3 Jun 2023 at 09:42, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

Yeah... oops. Obviously I typed the version in email. Should have done it in the shell. But you got the intention of set-ifying the characters in the large string.

Yep. I thought of that as I was originally writing, but absent benchmarking data, I prefer the simplest way of writing something. ChrisA

Reply

Sign in to reply online Use email software

David Mertz, Ph.D.

7:12 p.m.

Let's call the styles a tie. Using the SOWPODS scrabble wordlist (no currency symbols, so False answer):

Of course, this is a small character set of 26 lowercase letters (and newline as I did it). A more diverse alphabet might tip the timing slightly, but it's going to be a small matter either way. On Fri, Jun 2, 2023 at 7:49 PM Chris Angelico <rosuav@gmail.com> wrote:

-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

Reply

Sign in to reply online Use email software

Chris Angelico

7:20 p.m.

On Sat, 3 Jun 2023 at 10:12, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

Remember though, the original request was not for a set, but for a string. Try your timing again when working with a string. The any() form is almost certainly the most effective, although I suppose it could be implemented in C for better performance (avoiding calling back into Python repeatedly). Not sure it's necessary though. ChrisA

Reply

Sign in to reply online Use email software

655

Age (days ago)

657

Last active (days ago)

Download

23 comments

8 participants

tags

participants (8)

Antonio Carlos Jorge Patricio
Barry
Chris Angelico
David Mertz, Ph.D.
Ethan Furman
Marc-Andre Lemburg
Paul Moore
Richard Damon