[Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

Thu Nov 16 12:38:59 EST 2017

Who would benefit from changing this? Let's not change things just because
we can, or because Perl 6 does it.

On Thu, Nov 16, 2017 at 9:21 AM, MRAB <python at mrabarnett.plus.com> wrote:

> On 2017-11-16 10:23, Serhiy Storchaka wrote:
>
>> Currently the re module ignores only 6 ASCII whitespaces in the
>> re.VERBOSE mode:
>>
>>        U+0009 CHARACTER TABULATION
>>        U+000A LINE FEED
>>        U+000B LINE TABULATION
>>        U+000C FORM FEED
>>        U+000D CARRIAGE RETURN
>>        U+0020 SPACE
>>
>> Perl ignores characters that Unicode calls "Pattern White Space" in the
>> /x mode. It ignores additional 5 non-ASCII characters.
>>
>>        U+0085 NEXT LINE
>>        U+200E LEFT-TO-RIGHT MARK
>>        U+200F RIGHT-TO-LEFT MARK
>>        U+2028 LINE SEPARATOR
>>        U+2029 PARAGRAPH SEPARATOR
>>
>> The regex module just ignores characters for which str.isspace() returns
>> True. It ignores additional 20 non-ASCII whitespace characters,
>> including characters U+001C..001F whose classification as whitespaces is
>> questionable, but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT
>> MARK.
>>
>>        U+001C [FILE SEPARATOR]
>>        U+001D [GROUP SEPARATOR]
>>        U+001E [RECORD SEPARATOR]
>>        U+001F [UNIT SEPARATOR]
>>        U+00A0 NO-BREAK SPACE
>>        U+1680 OGHAM SPACE MARK
>>        U+2000 EN QUAD
>>        U+2001 EM QUAD
>>        U+2002 EN SPACE
>>        U+2003 EM SPACE
>>        U+2004 THREE-PER-EM SPACE
>>        U+2005 FOUR-PER-EM SPACE
>>        U+2006 SIX-PER-EM SPACE
>>        U+2007 FIGURE SPACE
>>        U+2008 PUNCTUATION SPACE
>>        U+2009 THIN SPACE
>>        U+200A HAIR SPACE
>>        U+202F NARROW NO-BREAK SPACE
>>        U+205F MEDIUM MATHEMATICAL SPACE
>>        U+3000 IDEOGRAPHIC SPACE
>>
>> str.isspace appears to be Unicode "Whitespace" plus those 4
> "questionable" codepoints.
>
>
> Is it worth to extend the set of ignored whitespaces to "Pattern
>> Whitespaces"? Would it add any benefit? Or add confusion? Should this
>> depend on the re.ASCII mode? Should the byte b'\x85' be ignorable in
>> verbose bytes patterns?
>>
>> And there is a similar question about the Python parser. If Python uses
>> Unicode definition for identifier, shouldn't it accept non-ASCII
>> "Pattern Whitespaces" as whitespaces? There will be technical problems
>> with supporting this, but are there any benefits?
>>
>>
>> https://perldoc.perl.org/perlre.html
>> https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax
>> https://unicode.org/L2/L2005/05012r-pattern.html
>>
>> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>

-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20171116/b0bac975/attachment.html>