[Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

Stephan Houben stephanh42 at gmail.com
Fri Nov 17 11:12:49 EST 2017


I put the actual space characters here so you can see them
in a non-proportional font (which I assume most Python programmer use).

https://gist.github.com/stephanh42/7c1c122154fd3f2cccc6d864233a40d8

The control characters aren't rendered at all (Vim renders them as ^\ ^] ^^
^_,
respectively). Most of the other spaces are rendered exactly like the
normal space.

The only ones which render differently are
U+1680 | | OGHAM SPACE MARK
U+3000 | | IDEOGRAPHIC SPACE

I understand Ogham has recently (since 6th century CE) seen a decline in
popularity.

However, I think Python should totally adopt U+3000 as a new whitespace
character
and start promoting it as the One True Way to indent code,
so as to finally end the age-old spaces vs tabs conflict.

[That was supposed to be a joke.]

Stephan



2017-11-17 16:38 GMT+01:00 Victor Stinner <victor.stinner at gmail.com>:

> I don't think that we need more than space (U+0020) and Unix newline
> (U+000A) ;-)
>
> Victor
>
> 2017-11-16 11:23 GMT+01:00 Serhiy Storchaka <storchaka at gmail.com>:
> > Currently the re module ignores only 6 ASCII whitespaces in the
> re.VERBOSE
> > mode:
> >
> >      U+0009 CHARACTER TABULATION
> >      U+000A LINE FEED
> >      U+000B LINE TABULATION
> >      U+000C FORM FEED
> >      U+000D CARRIAGE RETURN
> >      U+0020 SPACE
> >
> > Perl ignores characters that Unicode calls "Pattern White Space" in the
> /x
> > mode. It ignores additional 5 non-ASCII characters.
> >
> >      U+0085 NEXT LINE
> >      U+200E LEFT-TO-RIGHT MARK
> >      U+200F RIGHT-TO-LEFT MARK
> >      U+2028 LINE SEPARATOR
> >      U+2029 PARAGRAPH SEPARATOR
> >
> > The regex module just ignores characters for which str.isspace() returns
> > True. It ignores additional 20 non-ASCII whitespace characters, including
> > characters U+001C..001F whose classification as whitespaces is
> questionable,
> > but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK.
> >
> >      U+001C [FILE SEPARATOR]
> >      U+001D [GROUP SEPARATOR]
> >      U+001E [RECORD SEPARATOR]
> >      U+001F [UNIT SEPARATOR]
> >      U+00A0 NO-BREAK SPACE
> >      U+1680 OGHAM SPACE MARK
> >      U+2000 EN QUAD
> >      U+2001 EM QUAD
> >      U+2002 EN SPACE
> >      U+2003 EM SPACE
> >      U+2004 THREE-PER-EM SPACE
> >      U+2005 FOUR-PER-EM SPACE
> >      U+2006 SIX-PER-EM SPACE
> >      U+2007 FIGURE SPACE
> >      U+2008 PUNCTUATION SPACE
> >      U+2009 THIN SPACE
> >      U+200A HAIR SPACE
> >      U+202F NARROW NO-BREAK SPACE
> >      U+205F MEDIUM MATHEMATICAL SPACE
> >      U+3000 IDEOGRAPHIC SPACE
> >
> > Is it worth to extend the set of ignored whitespaces to "Pattern
> > Whitespaces"? Would it add any benefit? Or add confusion? Should this
> depend
> > on the re.ASCII mode? Should the byte b'\x85' be ignorable in verbose
> bytes
> > patterns?
> >
> > And there is a similar question about the Python parser. If Python uses
> > Unicode definition for identifier, shouldn't it accept non-ASCII "Pattern
> > Whitespaces" as whitespaces? There will be technical problems with
> > supporting this, but are there any benefits?
> >
> >
> > https://perldoc.perl.org/perlre.html
> > https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax
> > https://unicode.org/L2/L2005/05012r-pattern.html
> >
> > _______________________________________________
> > Python-ideas mailing list
> > Python-ideas at python.org
> > https://mail.python.org/mailman/listinfo/python-ideas
> > Code of Conduct: http://python.org/psf/codeofconduct/
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20171117/1621658d/attachment.html>


More information about the Python-ideas mailing list