[Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

Serhiy Storchaka storchaka at gmail.com
Thu Nov 16 05:23:07 EST 2017


Currently the re module ignores only 6 ASCII whitespaces in the 
re.VERBOSE mode:

      U+0009 CHARACTER TABULATION
      U+000A LINE FEED
      U+000B LINE TABULATION
      U+000C FORM FEED
      U+000D CARRIAGE RETURN
      U+0020 SPACE

Perl ignores characters that Unicode calls "Pattern White Space" in the 
/x mode. It ignores additional 5 non-ASCII characters.

      U+0085 NEXT LINE
      U+200E LEFT-TO-RIGHT MARK
      U+200F RIGHT-TO-LEFT MARK
      U+2028 LINE SEPARATOR
      U+2029 PARAGRAPH SEPARATOR

The regex module just ignores characters for which str.isspace() returns 
True. It ignores additional 20 non-ASCII whitespace characters, 
including characters U+001C..001F whose classification as whitespaces is 
questionable, but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK.

      U+001C [FILE SEPARATOR]
      U+001D [GROUP SEPARATOR]
      U+001E [RECORD SEPARATOR]
      U+001F [UNIT SEPARATOR]
      U+00A0 NO-BREAK SPACE
      U+1680 OGHAM SPACE MARK
      U+2000 EN QUAD
      U+2001 EM QUAD
      U+2002 EN SPACE
      U+2003 EM SPACE
      U+2004 THREE-PER-EM SPACE
      U+2005 FOUR-PER-EM SPACE
      U+2006 SIX-PER-EM SPACE
      U+2007 FIGURE SPACE
      U+2008 PUNCTUATION SPACE
      U+2009 THIN SPACE
      U+200A HAIR SPACE
      U+202F NARROW NO-BREAK SPACE
      U+205F MEDIUM MATHEMATICAL SPACE
      U+3000 IDEOGRAPHIC SPACE

Is it worth to extend the set of ignored whitespaces to "Pattern 
Whitespaces"? Would it add any benefit? Or add confusion? Should this 
depend on the re.ASCII mode? Should the byte b'\x85' be ignorable in 
verbose bytes patterns?

And there is a similar question about the Python parser. If Python uses 
Unicode definition for identifier, shouldn't it accept non-ASCII 
"Pattern Whitespaces" as whitespaces? There will be technical problems 
with supporting this, but are there any benefits?


https://perldoc.perl.org/perlre.html
https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax
https://unicode.org/L2/L2005/05012r-pattern.html



More information about the Python-ideas mailing list