<div dir="ltr">Who would benefit from changing this? Let's not change things just because we can, or because Perl 6 does it.<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Nov 16, 2017 at 9:21 AM, MRAB <span dir="ltr"><<a href="mailto:python@mrabarnett.plus.com" target="_blank">python@mrabarnett.plus.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 2017-11-16 10:23, Serhiy Storchaka wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Currently the re module ignores only 6 ASCII whitespaces in the<br>

re.VERBOSE mode:<br>

<br>

       U+0009 CHARACTER TABULATION<br>

       U+000A LINE FEED<br>

       U+000B LINE TABULATION<br>

       U+000C FORM FEED<br>

       U+000D CARRIAGE RETURN<br>

       U+0020 SPACE<br>

<br>

Perl ignores characters that Unicode calls "Pattern White Space" in the<br>

/x mode. It ignores additional 5 non-ASCII characters.<br>

<br>

       U+0085 NEXT LINE<br>

       U+200E LEFT-TO-RIGHT MARK<br>

       U+200F RIGHT-TO-LEFT MARK<br>

       U+2028 LINE SEPARATOR<br>

       U+2029 PARAGRAPH SEPARATOR<br>

<br>

The regex module just ignores characters for which str.isspace() returns<br>

True. It ignores additional 20 non-ASCII whitespace characters,<br>

including characters U+001C..001F whose classification as whitespaces is<br>

questionable, but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK.<br>

<br>

       U+001C [FILE SEPARATOR]<br>

       U+001D [GROUP SEPARATOR]<br>

       U+001E [RECORD SEPARATOR]<br>

       U+001F [UNIT SEPARATOR]<br>

       U+00A0 NO-BREAK SPACE<br>

       U+1680 OGHAM SPACE MARK<br>

       U+2000 EN QUAD<br>

       U+2001 EM QUAD<br>

       U+2002 EN SPACE<br>

       U+2003 EM SPACE<br>

       U+2004 THREE-PER-EM SPACE<br>

       U+2005 FOUR-PER-EM SPACE<br>

       U+2006 SIX-PER-EM SPACE<br>

       U+2007 FIGURE SPACE<br>

       U+2008 PUNCTUATION SPACE<br>

       U+2009 THIN SPACE<br>

       U+200A HAIR SPACE<br>

       U+202F NARROW NO-BREAK SPACE<br>

       U+205F MEDIUM MATHEMATICAL SPACE<br>

       U+3000 IDEOGRAPHIC SPACE<br>

<br>

</blockquote></div></div>

str.isspace appears to be Unicode "Whitespace" plus those 4 "questionable" codepoints.<div class="HOEnZb"><div class="h5"><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Is it worth to extend the set of ignored whitespaces to "Pattern<br>

Whitespaces"? Would it add any benefit? Or add confusion? Should this<br>

depend on the re.ASCII mode? Should the byte b'\x85' be ignorable in<br>

verbose bytes patterns?<br>

<br>

And there is a similar question about the Python parser. If Python uses<br>

Unicode definition for identifier, shouldn't it accept non-ASCII<br>

"Pattern Whitespaces" as whitespaces? There will be technical problems<br>

with supporting this, but are there any benefits?<br>

<br>

<br>

<a href="https://perldoc.perl.org/perlre.html" rel="noreferrer" target="_blank">https://perldoc.perl.org/perlr<wbr>e.html</a><br>

<a href="https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax" rel="noreferrer" target="_blank">https://www.unicode.org/report<wbr>s/tr31/tr31-4.html#Pattern_<wbr>Syntax</a><br>

<a href="https://unicode.org/L2/L2005/05012r-pattern.html" rel="noreferrer" target="_blank">https://unicode.org/L2/L2005/0<wbr>5012r-pattern.html</a><br>

<br>

</blockquote>

______________________________<wbr>_________________<br>

Python-ideas mailing list<br>

<a href="mailto:Python-ideas@python.org" target="_blank">Python-ideas@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/python-ideas" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/python-ideas</a><br>

Code of Conduct: <a href="http://python.org/psf/codeofconduct/" rel="noreferrer" target="_blank">http://python.org/psf/codeofco<wbr>nduct/</a><br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">--Guido van Rossum (<a href="http://python.org/~guido" target="_blank">python.org/~guido</a>)</div>

</div>