<div dir="ltr">Who would benefit from changing this? Let's not change things just because we can, or because Perl 6 does it.<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Nov 16, 2017 at 9:21 AM, MRAB <span dir="ltr"><<a href="mailto:python@mrabarnett.plus.com" target="_blank">python@mrabarnett.plus.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 2017-11-16 10:23, Serhiy Storchaka wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Currently the re module ignores only 6 ASCII whitespaces in the<br>
re.VERBOSE mode:<br>
<br>
U+0009 CHARACTER TABULATION<br>
U+000A LINE FEED<br>
U+000B LINE TABULATION<br>
U+000C FORM FEED<br>
U+000D CARRIAGE RETURN<br>
U+0020 SPACE<br>
<br>
Perl ignores characters that Unicode calls "Pattern White Space" in the<br>
/x mode. It ignores additional 5 non-ASCII characters.<br>
<br>
U+0085 NEXT LINE<br>
U+200E LEFT-TO-RIGHT MARK<br>
U+200F RIGHT-TO-LEFT MARK<br>
U+2028 LINE SEPARATOR<br>
U+2029 PARAGRAPH SEPARATOR<br>
<br>
The regex module just ignores characters for which str.isspace() returns<br>
True. It ignores additional 20 non-ASCII whitespace characters,<br>
including characters U+001C..001F whose classification as whitespaces is<br>
questionable, but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK.<br>
<br>
U+001C [FILE SEPARATOR]<br>
U+001D [GROUP SEPARATOR]<br>
U+001E [RECORD SEPARATOR]<br>
U+001F [UNIT SEPARATOR]<br>
U+00A0 NO-BREAK SPACE<br>
U+1680 OGHAM SPACE MARK<br>
U+2000 EN QUAD<br>
U+2001 EM QUAD<br>
U+2002 EN SPACE<br>
U+2003 EM SPACE<br>
U+2004 THREE-PER-EM SPACE<br>
U+2005 FOUR-PER-EM SPACE<br>
U+2006 SIX-PER-EM SPACE<br>
U+2007 FIGURE SPACE<br>
U+2008 PUNCTUATION SPACE<br>
U+2009 THIN SPACE<br>
U+200A HAIR SPACE<br>
U+202F NARROW NO-BREAK SPACE<br>
U+205F MEDIUM MATHEMATICAL SPACE<br>
U+3000 IDEOGRAPHIC SPACE<br>
<br>
</blockquote></div></div>
str.isspace appears to be Unicode "Whitespace" plus those 4 "questionable" codepoints.<div class="HOEnZb"><div class="h5"><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Is it worth to extend the set of ignored whitespaces to "Pattern<br>
Whitespaces"? Would it add any benefit? Or add confusion? Should this<br>
depend on the re.ASCII mode? Should the byte b'\x85' be ignorable in<br>
verbose bytes patterns?<br>
<br>
And there is a similar question about the Python parser. If Python uses<br>
Unicode definition for identifier, shouldn't it accept non-ASCII<br>
"Pattern Whitespaces" as whitespaces? There will be technical problems<br>
with supporting this, but are there any benefits?<br>
<br>
<br>
<a href="https://perldoc.perl.org/perlre.html" rel="noreferrer" target="_blank">https://perldoc.perl.org/perlr<wbr>e.html</a><br>
<a href="https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax" rel="noreferrer" target="_blank">https://www.unicode.org/report<wbr>s/tr31/tr31-4.html#Pattern_<wbr>Syntax</a><br>
<a href="https://unicode.org/L2/L2005/05012r-pattern.html" rel="noreferrer" target="_blank">https://unicode.org/L2/L2005/0<wbr>5012r-pattern.html</a><br>
<br>
</blockquote>
______________________________<wbr>_________________<br>
Python-ideas mailing list<br>
<a href="mailto:Python-ideas@python.org" target="_blank">Python-ideas@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/python-ideas" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/python-ideas</a><br>
Code of Conduct: <a href="http://python.org/psf/codeofconduct/" rel="noreferrer" target="_blank">http://python.org/psf/codeofco<wbr>nduct/</a><br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">--Guido van Rossum (<a href="http://python.org/~guido" target="_blank">python.org/~guido</a>)</div>
</div>