
Steven D'Aprano writes:
On Sat, Jan 07, 2023 at 10:48:48AM -0800, Peter Ludemann wrote:
You can get almost the same result using pattern matching. For example, your "foo:bar;baz".partition(":", ";") can be done by a well-known matching idiom: re.match(r'([^:]*):([^;]*);(.*)', 'foo:bar;baz').groups()
"Well-known" he says :-)
It *is* well-known to those who know. Just because you don't like regex doesn't mean it's not well-known. I wouldn't use that idiom though; I'd use an explicit character class in most cases I encounter.
I think that the regex solution is also wrong because it requires you to know *exactly* what order the separators are found in the source string.
But that's characteristic of many examples. In "structured" mail headers like Content-Type, you want the separators to come in the order ':', '=', ';'. In a URI scheme with an authority component, you want them in the order '@', ':'. Except that you don't, in both those examples. In Content-Type, the '=' is optional, and there may be multiple ';'. In authority, the existing ':' is optional, and there's an optional ':' to separate password from username before the '@'. And it gets worse: in the authority case, the username is optional. In the common case of anonymous access, the username is omitted, so user, _, domain = "example.com".partition('@') does the wrong thing!
If we swap the semi-colon and the colon in the source, but not the pattern, the idiom fails:
>>> re.match(r'([^:]*):([^;]*);(.*)', 'foo;bar:baz').groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups'
So that makes it useless for the case where you want to split of any of a number of separators, but don't know which order they occur in.
Examples where the order of separators doesn't matter? In most of the examples I need, swapping order is a parse error.
You call it "almost the same result" but it is nothing like the result from partition. The separators are lost,
Trivial to fix, just add parens, in the simpler grouping form as a bonus! I'm not asking you to like the resulting regexp better, just pointing out that your dislike of regex is driving the discussion in unprofitable directions.
and it splits the string all at once instead of one split per call.
So does the original proposal, that's part of the point of it, I think. I really don't see any of the variations on the proposal as a particularly valuable addition. It's already easy to screw up your parse with str.partition (the authority example: although you can fix the order problem with '@' by using str.rpartition, the multiple optional ':' mean that whichever r?partition you use, you can get it wrong unless you check the order of '@' and ':', so you have to use a recursive parse, not a sequential parse). But you can write a regex version of authority to give a sequence of tokens rather than a parse, and you convert that into a parse by checking each element of the sequence for None in a deterministic order. I prefer the latter approach (Emacs user since Emacs was programmed in TECO), but as long as you allow me to use regex for character classes and sequences, I can live with retrictions on use of regex in the style guide. Parsing is hard. Both regex and r?partition are best used as low- level tools for tokenizing, and you're asking for trouble if you try to use them for parsing past a certain point. My breaking point for regex is somewhere around the authority example, but I wouldn't push back if my project's style guide said to to break that up. I *would* however often prefer regexp to r?partition because it would allow character classes, and in most of the areas I work with (mail, URIs, encodings) being able to detect lexical errors by using character classes is helpful. And I would prefer "one bite per call" partition to a partition at multiple points. Where I'm being pretty fuzzy, the .split methods are fine. -- Yet another Steve