data:image/s3,"s3://crabby-images/2eb67/2eb67cbdf286f4b7cb5a376d9175b1c368b87f28" alt=""
On 2017-11-16 10:23, Serhiy Storchaka wrote:
Currently the re module ignores only 6 ASCII whitespaces in the re.VERBOSE mode:
U+0009 CHARACTER TABULATION U+000A LINE FEED U+000B LINE TABULATION U+000C FORM FEED U+000D CARRIAGE RETURN U+0020 SPACE
Perl ignores characters that Unicode calls "Pattern White Space" in the /x mode. It ignores additional 5 non-ASCII characters.
U+0085 NEXT LINE U+200E LEFT-TO-RIGHT MARK U+200F RIGHT-TO-LEFT MARK U+2028 LINE SEPARATOR U+2029 PARAGRAPH SEPARATOR
The regex module just ignores characters for which str.isspace() returns True. It ignores additional 20 non-ASCII whitespace characters, including characters U+001C..001F whose classification as whitespaces is questionable, but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK.
U+001C [FILE SEPARATOR] U+001D [GROUP SEPARATOR] U+001E [RECORD SEPARATOR] U+001F [UNIT SEPARATOR] U+00A0 NO-BREAK SPACE U+1680 OGHAM SPACE MARK U+2000 EN QUAD U+2001 EM QUAD U+2002 EN SPACE U+2003 EM SPACE U+2004 THREE-PER-EM SPACE U+2005 FOUR-PER-EM SPACE U+2006 SIX-PER-EM SPACE U+2007 FIGURE SPACE U+2008 PUNCTUATION SPACE U+2009 THIN SPACE U+200A HAIR SPACE U+202F NARROW NO-BREAK SPACE U+205F MEDIUM MATHEMATICAL SPACE U+3000 IDEOGRAPHIC SPACE
str.isspace appears to be Unicode "Whitespace" plus those 4 "questionable" codepoints.
Is it worth to extend the set of ignored whitespaces to "Pattern Whitespaces"? Would it add any benefit? Or add confusion? Should this depend on the re.ASCII mode? Should the byte b'\x85' be ignorable in verbose bytes patterns?
And there is a similar question about the Python parser. If Python uses Unicode definition for identifier, shouldn't it accept non-ASCII "Pattern Whitespaces" as whitespaces? There will be technical problems with supporting this, but are there any benefits?
https://perldoc.perl.org/perlre.html https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax https://unicode.org/L2/L2005/05012r-pattern.html