Re: Multiple arguments to str.partition and bytes.partition
You can get almost the same result using pattern matching. For example, your "foo:bar;baz".partition(":", ";") can be done by a well-known matching idiom: re.match(r'([^:]*):([^;]*);(.*)', 'foo:bar;baz').groups()
On Sat, Jan 07, 2023 at 10:48:48AM -0800, Peter Ludemann wrote:
You can get almost the same result using pattern matching. For example, your "foo:bar;baz".partition(":", ";") can be done by a well-known matching idiom: re.match(r'([^:]*):([^;]*);(.*)', 'foo:bar;baz').groups()
"Well-known" he says :-) I think that is a perfect example of the ability to use regexes for obfuscation. It gets worse if you want to partition on a regex metacharacter like '.' I think that the regex solution is also wrong because it requires you to know *exactly* what order the separators are found in the source string. If we swap the semi-colon and the colon in the source, but not the pattern, the idiom fails: >>> re.match(r'([^:]*):([^;]*);(.*)', 'foo;bar:baz').groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups' So that makes it useless for the case where you want to split of any of a number of separators, but don't know which order they occur in. You call it "almost the same result" but it is nothing like the result from partition. The separators are lost, and it splits the string all at once instead of one split per call. I think this would be a closer match: ```
re.split(r'[:;]', 'foo:bar;baz', maxsplit=1) ['foo', 'bar;baz']
but even there we lose the information of which separator was
partitioned on.
--
Steve
Steven D'Aprano writes:
On Sat, Jan 07, 2023 at 10:48:48AM -0800, Peter Ludemann wrote:
You can get almost the same result using pattern matching. For example, your "foo:bar;baz".partition(":", ";") can be done by a well-known matching idiom: re.match(r'([^:]*):([^;]*);(.*)', 'foo:bar;baz').groups()
"Well-known" he says :-)
It *is* well-known to those who know. Just because you don't like regex doesn't mean it's not well-known. I wouldn't use that idiom though; I'd use an explicit character class in most cases I encounter.
I think that the regex solution is also wrong because it requires you to know *exactly* what order the separators are found in the source string.
But that's characteristic of many examples. In "structured" mail headers like Content-Type, you want the separators to come in the order ':', '=', ';'. In a URI scheme with an authority component, you want them in the order '@', ':'. Except that you don't, in both those examples. In Content-Type, the '=' is optional, and there may be multiple ';'. In authority, the existing ':' is optional, and there's an optional ':' to separate password from username before the '@'. And it gets worse: in the authority case, the username is optional. In the common case of anonymous access, the username is omitted, so user, _, domain = "example.com".partition('@') does the wrong thing!
If we swap the semi-colon and the colon in the source, but not the pattern, the idiom fails:
>>> re.match(r'([^:]*):([^;]*);(.*)', 'foo;bar:baz').groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups'
So that makes it useless for the case where you want to split of any of a number of separators, but don't know which order they occur in.
Examples where the order of separators doesn't matter? In most of the examples I need, swapping order is a parse error.
You call it "almost the same result" but it is nothing like the result from partition. The separators are lost,
Trivial to fix, just add parens, in the simpler grouping form as a bonus! I'm not asking you to like the resulting regexp better, just pointing out that your dislike of regex is driving the discussion in unprofitable directions.
and it splits the string all at once instead of one split per call.
So does the original proposal, that's part of the point of it, I think. I really don't see any of the variations on the proposal as a particularly valuable addition. It's already easy to screw up your parse with str.partition (the authority example: although you can fix the order problem with '@' by using str.rpartition, the multiple optional ':' mean that whichever r?partition you use, you can get it wrong unless you check the order of '@' and ':', so you have to use a recursive parse, not a sequential parse). But you can write a regex version of authority to give a sequence of tokens rather than a parse, and you convert that into a parse by checking each element of the sequence for None in a deterministic order. I prefer the latter approach (Emacs user since Emacs was programmed in TECO), but as long as you allow me to use regex for character classes and sequences, I can live with retrictions on use of regex in the style guide. Parsing is hard. Both regex and r?partition are best used as low- level tools for tokenizing, and you're asking for trouble if you try to use them for parsing past a certain point. My breaking point for regex is somewhere around the authority example, but I wouldn't push back if my project's style guide said to to break that up. I *would* however often prefer regexp to r?partition because it would allow character classes, and in most of the areas I work with (mail, URIs, encodings) being able to detect lexical errors by using character classes is helpful. And I would prefer "one bite per call" partition to a partition at multiple points. Where I'm being pretty fuzzy, the .split methods are fine. -- Yet another Steve
On Sun, Jan 08, 2023 at 05:30:30PM +0900, Stephen J. Turnbull wrote:
Steven D'Aprano writes:
On Sat, Jan 07, 2023 at 10:48:48AM -0800, Peter Ludemann wrote:
You can get almost the same result using pattern matching. For example, your "foo:bar;baz".partition(":", ";") can be done by a well-known matching idiom: re.match(r'([^:]*):([^;]*);(.*)', 'foo:bar;baz').groups()
"Well-known" he says :-)
It *is* well-known to those who know. Just because you don't like regex doesn't mean it's not well-known.
I like regexes plenty, for what they are good for. But my *liking* them or not is irrelevant as to whether this example is "well-known" or not. I'm not the heaviest regex user in the world, but I've used my share, and I've never seen this particular line noise before. (Hey, I like Forth. Sometimes line noise is great.) I mean, if all you are doing is splitting the source by some separators regardless of order, surely this does the same job and is *vastly* more obvious?
re.split(r'[:;]', 'foo:bar;baz') ['foo', 'bar', 'baz']
If the order matters:
re.match('(.*):(.*);(.*)', 'foo:bar;baz').groups() ('foo', 'bar', 'baz')
Or use non-greedy wildcards if you need them:
re.match('(.*?):(.*?);(.*)', 'foo:b:ar;ba;z').groups() ('foo', 'b:ar', 'ba;z')
I think that the regex solution is also wrong because it requires you to know *exactly* what order the separators are found in the source string.
But that's characteristic of many examples.
Great. Then for *those* structured examples you can happily write your regex and put the separators in the order you expect. But I'm talking about *unstructured* examples where you don't know the order of the separators, you want to split on whichever one comes first regardless of the order, and you need to know which separator that was. [...]
Examples where the order of separators doesn't matter? In most of the examples I need, swapping order is a parse error.
Okay, then you *mostly* don't need this.
and it splits the string all at once instead of one split per call.
So does the original proposal, that's part of the point of it, I think.
str.partition does *one* three way split, into (head, sep, tail). If you want to continue to partition the tail, you have to call it again. To me, that fixed "one bite per call" design is fundamental to partition(). If we wanted an arbitrary number of splits we'd use, um, split() :-) Of course we can debate the pros and cons of each, that's what this thread is for.
Parsing is hard. Both regex and r?partition are best used as low- level tools for tokenizing, and you're asking for trouble if you try to use them for parsing past a certain point.
Right! I agree! And that is why I want partition to accept multiple separators and split on the first one found. I find myself needing to do that, well, not "all the time" by any means, but often enough that its an itch I want scratched.
My breaking point for regex is somewhere around the authority example,
Heh, I've written much more complicated examples. It was kinda fun, until I came back to it a month later and couldn't understand what the hell it did! :-)
but I wouldn't push back if my project's style guide said to to break that up. I *would* however often prefer regexp to r?partition because it would allow character classes, and in most of the areas I work with (mail, URIs, encodings) being able to detect lexical errors by using character classes is helpful.
I'm not sure I quite understand you there, but if I do, I would prefer to split the string and then validate the head and tail afterwards, rather than just have the regex fail.
And I would prefer "one bite per call" partition to a partition at multiple points. Where I'm being pretty fuzzy, the .split methods are fine.
I think we agree here. -- Steve
Steven D'Aprano writes:
I mean, if all you are doing is splitting the source by some separators regardless of order, surely this does the same job and is *vastly* more obvious?
re.split(r'[:;]', 'foo:bar;baz') ['foo', 'bar', 'baz']
"Obvious" yes, but it's also easy to invest that call with semantics (eg, "just three segments because that's the allowed syntax") that it doesn't possess. You haven't stated how many elements it should be split into, nor whether the separator characters are permitted in components, nor whether this component is the whole input and this regexp defines the whole syntax. The point of the "well-known idiom" is to specify most of that (and it doesn't take much much more to specify all of it, specifying "no separators in components" is the most space-consuming part of the expression!) Your other alternatives have the same potential issues.
But that's characteristic of many examples.
Great. Then for *those* structured examples you can happily write your regex and put the separators in the order you expect.
But I'm talking about *unstructured* examples where you don't know the order of the separators, you want to split on whichever one comes first regardless of the order, and you need to know which separator that was.
That's easy enough to do with a (relatively unknown to some ;-) regular expression: re.match("([^;:]*)([;:])(.*)", source) The question is whether the need is frequent enough and that's hard enough to understand / ugly enough to warrant another method or an incompatible extension to str.partition (and str.rpartition).[1]
Examples where the order of separators doesn't matter? In most of the examples I need, swapping order is a parse error.
Okay, then you *mostly* don't need this.
I already knew that. Without real examples, I can't judge whether I'm pro-status quo or pro-serving-the-nonuniversal-but-still-useful-case.
str.partition does *one* three way split, into (head, sep, tail). If you want to continue to partition the tail, you have to call it again.
I'm much more favorable to proposals where str.partition and str.rpartition split at *one* point, but the OP seemed intended to do more work (but not arbitrary amounts!) per call.
I'm not sure I quite understand you there, but if I do, I would prefer to split the string and then validate the head and tail afterwards, rather than just have the regex fail.
For me, often that depends on how hard I'm willing to work to support users. If the only user is myself, that's very often zero. In the case of the "well-known idiom", the only ways the regexp can fail involve wrong number of separators. I'd be willing to impose that burden on users with a "wrong number of separators" message. Another case is where I want an efficient parser for the vast majority of conformant cases and am willing to do redundant work for the error cases. Footnotes: [1] Here "incompatible" means that people writing code that must support previous versions of Python can't use it.
On Sun, 8 Jan 2023 at 08:32, Stephen J. Turnbull
Steven D'Aprano writes:
On Sat, Jan 07, 2023 at 10:48:48AM -0800, Peter Ludemann wrote:
You can get almost the same result using pattern matching. For example, your "foo:bar;baz".partition(":", ";") can be done by a well-known matching idiom: re.match(r'([^:]*):([^;]*);(.*)', 'foo:bar;baz').groups()
I think that the regex solution is also wrong because it requires you to know *exactly* what order the separators are found in the source string.
But that's characteristic of many examples. In "structured" mail headers like Content-Type, you want the separators to come in the order ':', '=', ';'. In a URI scheme with an authority component, you want them in the order '@', ':'.
+1 (while also recognising the caveats you mention subsequently)
Except that you don't, in both those examples. In Content-Type, the '=' is optional, and there may be multiple ';'. In authority, the existing ':' is optional, and there's an optional ':' to separate password from username before the '@'.
Trying to avoid the usual discussions about permissive parsing / supporting various implementations in-the-wild: long-term, the least ambiguous and most computationally-efficient environment would probably want to reduce special cases like that? (both in-data and in-code)
user, _, domain = "example.com".partition('@')
does the wrong thing!
Yep - it's important to choose partition arguments (I'm mostly-resisting the temptation to call them a 'pattern') that are appropriate for the input. Structural pattern matching _seems_ like it could correspond here, in terms of selecting appropriate arguments -- but it is, as I understand it, limited to at-most-one wildcard pattern per match (by sensible design).
I would prefer "one bite per call" partition to a partition at multiple points.
That does seem clearer - and clearer is, generally, probably better. I suppose an analysis (that I don't have the ability to perform easily) could be to determine how many regular expression codesites could be migrated compatibly and beneficially by using multiple-partition-arguments.
James Addison via Python-ideas writes:
On Sun, 8 Jan 2023 at 08:32, Stephen J. Turnbull
wrote:
Trying to avoid the usual discussions about permissive parsing / supporting various implementations in-the-wild: long-term, the least ambiguous and most computationally-efficient environment would probably want to reduce special cases like that? (both in-data and in-code)
That's not very human-friendly, though. Push that to extremes and you get XML. "Nobody expects the XML Validators!"
Structural pattern matching _seems_ like it could correspond here, in terms of selecting appropriate arguments -- but it is, as I understand it, limited to at-most-one wildcard pattern per match (by sensible design).
If I understand what you mean by "structural pattern matching", that seems more appropriate to parsing already tokenized input.
I suppose an analysis (that I don't have the ability to perform easily) could be to determine how many regular expression codesites could be migrated compatibly and beneficially by using multiple-partition-arguments.
My guess is that for re.match (or re.search) it would be relatively few. People tend to reach for regular expression matching when they have repetition or alternatives that they want to capture in a single expression, and that is generally not going to be easy to capture with str.partition. But I bet that *many* calls to re.split take regular expressions of the form f'[{separators}]' which would be easy enough to search for. That's where you could reduce the number of regexps. Steve
participants (4)
-
James Addison
-
Peter Ludemann
-
Stephen J. Turnbull
-
Steven D'Aprano