[Python-ideas] Add regex pattern literal p""

Sat Dec 29 01:56:19 EST 2018

On Sat, Dec 29, 2018 at 12:30 AM Alexander Heger <python at 2sn.net> wrote:

> for regular strings one can write
>
> "aaa" + "bbb"
>
> which also works for f-strings, r-strings, etc.; in regular expressions,
> there is, e.g., parameter counting and references to numbered matches.  How
> would that be dealt with in a compound p-string?  Either it would have to
> re-compiled or not, either way could lead to unexpected results
>
> p"(\d)\1" + p"(\s)\1"
>
> or
>
> p"^(\w)" + p"^(\d)"
>
> regular strings can be added, bu the results of p-string could not - well,
> their are not strings.
>

Isn't this a feature, not a bug, of encouraging literals to be specified as
patterns: addition of patterns would raise an error (as is currently the
case for addition of compiled patterns in the re and regex modules)?
Currently, I find it easiest to use r-strings for patterns and call
re.search() etc. without precompiling them, which means that I could
accidentally concatenate two patterns together that would silently produce
an unmatchable pattern. Using p-literals for most patterns would mean I
have to be explicit in the exceptional case where I do want to assemble a
pattern from multiple parts:

FIRSTNAME = p"[A-Z][-A-Za-z']+"
LASTNAME = p"[-A-Za-z']([-A-Za-z' ]+[-A-Za-z'])?"
FULLNAME = FIRSTNAME + p' ' + LASTNAME # error

FIRSTNAME = r"[A-Z][-A-Za-z']+"
LASTNAME = r"[-A-Za-z']([-A-Za-z' ]+[-A-Za-z'])?"
FULLNAME = re.compile(FIRSTNAME + ' ' + LASTNAME) # success

Another potential advantage is that an ill-formed p-literal (such as a
mismatched parenthesis) would be caught immediately, rather than when it is
first used. This could pay off, for example, if I am defining a data
structure with a bunch of regexes that would get used for different input.
(But there may be performance tradeoffs here.)

> This brings me to the point that
> the key difference is that f- and r- strings actually return strings,
> whereas p- string would return a different kind of object.
> That would seem certainly very confusing to novices - and also for the
> language standard as a whole.
>
>
The b prefix produces a bytes literal. Is a bytes object a kind of string,
more so than a regex pattern is? I could see an argument that bytes is a
particular encoding of sequential character data, whereas a regex pattern
represents a string *language*, i.e. an abstraction over string data.
But...this distinction starts to feel very theoretical rather than
practical. If novices are expected to read code with regular expressions in
it, why would they have trouble understanding that the "p" prefix means
"pattern"?

As someone who works with text a lot, I think there's a decent
practicality-beats-purity argument in favor of p-literals, which would make
regex operations more easily accessible and prevent patterns from being
mixed up with string data.

A potential downside, though, is that it will be tempting to introduce
flags as prefixes, too. Do we want to go down the road of pui"my
Unicode-compatible case-insensitive pattern"?

Nathan

> -Alexander
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20181229/206fde61/attachment-0001.html>