[Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

MRAB python at mrabarnett.plus.com
Thu Nov 16 12:21:51 EST 2017


On 2017-11-16 10:23, Serhiy Storchaka wrote:
> Currently the re module ignores only 6 ASCII whitespaces in the
> re.VERBOSE mode:
> 
>        U+0009 CHARACTER TABULATION
>        U+000A LINE FEED
>        U+000B LINE TABULATION
>        U+000C FORM FEED
>        U+000D CARRIAGE RETURN
>        U+0020 SPACE
> 
> Perl ignores characters that Unicode calls "Pattern White Space" in the
> /x mode. It ignores additional 5 non-ASCII characters.
> 
>        U+0085 NEXT LINE
>        U+200E LEFT-TO-RIGHT MARK
>        U+200F RIGHT-TO-LEFT MARK
>        U+2028 LINE SEPARATOR
>        U+2029 PARAGRAPH SEPARATOR
> 
> The regex module just ignores characters for which str.isspace() returns
> True. It ignores additional 20 non-ASCII whitespace characters,
> including characters U+001C..001F whose classification as whitespaces is
> questionable, but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK.
> 
>        U+001C [FILE SEPARATOR]
>        U+001D [GROUP SEPARATOR]
>        U+001E [RECORD SEPARATOR]
>        U+001F [UNIT SEPARATOR]
>        U+00A0 NO-BREAK SPACE
>        U+1680 OGHAM SPACE MARK
>        U+2000 EN QUAD
>        U+2001 EM QUAD
>        U+2002 EN SPACE
>        U+2003 EM SPACE
>        U+2004 THREE-PER-EM SPACE
>        U+2005 FOUR-PER-EM SPACE
>        U+2006 SIX-PER-EM SPACE
>        U+2007 FIGURE SPACE
>        U+2008 PUNCTUATION SPACE
>        U+2009 THIN SPACE
>        U+200A HAIR SPACE
>        U+202F NARROW NO-BREAK SPACE
>        U+205F MEDIUM MATHEMATICAL SPACE
>        U+3000 IDEOGRAPHIC SPACE
> 
str.isspace appears to be Unicode "Whitespace" plus those 4 
"questionable" codepoints.

> Is it worth to extend the set of ignored whitespaces to "Pattern
> Whitespaces"? Would it add any benefit? Or add confusion? Should this
> depend on the re.ASCII mode? Should the byte b'\x85' be ignorable in
> verbose bytes patterns?
> 
> And there is a similar question about the Python parser. If Python uses
> Unicode definition for identifier, shouldn't it accept non-ASCII
> "Pattern Whitespaces" as whitespaces? There will be technical problems
> with supporting this, but are there any benefits?
> 
> 
> https://perldoc.perl.org/perlre.html
> https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax
> https://unicode.org/L2/L2005/05012r-pattern.html
> 


More information about the Python-ideas mailing list