Regex pattern matching
Hi, I've been thinking that it would be nice if regex match objects could be deconstructed with pattern matching. For example, a simple .obj parser could use it like this: match re.match(r"(v|f) (\d+) (\d+) (\d+)", line): case ["v", x, y, z]: print("Handle vertex") case ["f", a, b, c]: print("Handle face") Sequence patterns would extract groups directly. Mapping patterns could be used to extract named groups, which would be nice for simple parsers/tokenizers: match re.match(r"(?P<number>\d+)|(?P<add>\+)|(?P<mul>\*)", line): case {"number": str(value)}: return Token(type="number", value=int(value)) case {"add": str()}: return Token(type="add") case {"mul": str()}: return Token(type="mul") Right now, match objects aren't proper sequence or mapping types though, but that doesn't seem too complicated to achieve. If this is something that enough people would consider useful I'm willing to look into how to implement this.
See https://bugs.python.org/issue46692. It's not so easy to make match objects mappings or sequences because of the len() problem. Eric On 2/16/2022 9:46 AM, Valentin Berlier wrote:
Hi,
I've been thinking that it would be nice if regex match objects could be deconstructed with pattern matching. For example, a simple .obj parser could use it like this:
match re.match(r"(v|f) (\d+) (\d+) (\d+)", line): case ["v", x, y, z]: print("Handle vertex") case ["f", a, b, c]: print("Handle face")
Sequence patterns would extract groups directly. Mapping patterns could be used to extract named groups, which would be nice for simple parsers/tokenizers:
match re.match(r"(?P<number>\d+)|(?P<add>\+)|(?P<mul>\*)", line): case {"number": str(value)}: return Token(type="number", value=int(value)) case {"add": str()}: return Token(type="add") case {"mul": str()}: return Token(type="mul")
Right now, match objects aren't proper sequence or mapping types though, but that doesn't seem too complicated to achieve. If this is something that enough people would consider useful I'm willing to look into how to implement this. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/EKMIJC... Code of Conduct: http://python.org/psf/codeofconduct/
I see. I guess the ambiguity would stem from trying to force match objects into the sequence protocol even though the custom __getitem__() means that they're essentially a mixed mapping: Mapping[int | str, str | None] If we avoid any sort of "smart" length derived from only mo.groups() or mo.groupdict(), there's nothing stopping match objects from acting as proper mappings. We would need __iter__() which would simply yield all the available keys, including group 0 and all the named groups, and __len__() which would return the total number of keys. My point is that the match object doesn't need to masquerade as something else to be useful, just implement the protocol to describe the available keys. m = re.match(r"(a) (?P<foo>b)(x)?", "a b") list(m) # [0, 1, 2, 3, 'foo'] dict(m) # {0: 'a b', 1: 'a', 2: 'b', 3: None, 'foo': 'b'} This means that pattern matching with mapping patterns would work automatically. The first example I shared would look like this: match re.match(r"(v|f) (\d+) (\d+) (\d+)", line): case {1: "v", 2: x, 3: y, 4: z}: print("Handle vertex") case {1: "f", 2: a, 3: b, 4: c}: print("Handle face") The second example would work without any changes: match re.match(r"(?P<number>\d+)|(?P<add>+)|(?P<mul>*)", line): case {"number": str(value)}: return Token(type="number", value=int(value)) case {"add": str()}: return Token(type="add") case {"mul": str()}: return Token(type="mul")
On Wed, 16 Feb 2022 at 14:47, Valentin Berlier <berlier.v@gmail.com> wrote:
Hi,
I've been thinking that it would be nice if regex match objects could be deconstructed with pattern matching. For example, a simple .obj parser could use it like this:
match re.match(r"(v|f) (\d+) (\d+) (\d+)", line): case ["v", x, y, z]: print("Handle vertex") case ["f", a, b, c]: print("Handle face")
Sequence patterns would extract groups directly. Mapping patterns could be used to extract named groups, which would be nice for simple parsers/tokenizers:
match re.match(r"(?P<number>\d+)|(?P<add>\+)|(?P<mul>\*)", line): case {"number": str(value)}: return Token(type="number", value=int(value)) case {"add": str()}: return Token(type="add") case {"mul": str()}: return Token(type="mul")
Right now, match objects aren't proper sequence or mapping types though, but that doesn't seem too complicated to achieve. If this is something that enough people would consider useful I'm willing to look into how to implement this.
I'm not sure I really see the benefit of this, but if you want to do it, couldn't you just write a wrapper?
class MatchAsSeq(Sequence): ... def __getattr__(self, attr): ... return getattr(self.m, attr) ... def __len__(self): ... return len(self.m.groups()) ... def __init__(self, m): ... self.m = m ... def __getitem__(self, n): ... return self.group(n+1) ... line = "v 1 12 3" match MatchAsSeq(re.match(r"(v|f) (\d+) (\d+) (\d+)", line)): ... case ["v", x, y, z]: ... print("Handle vertex") ... case ["f", a, b, c]: ... print("Handle face") ... Handle vertex
Paul
participants (3)
-
Eric V. Smith
-
Paul Moore
-
Valentin Berlier