Re: Multiple arguments to str.partition and bytes.partition
data:image/s3,"s3://crabby-images/cd4d9/cd4d96ba9a087e46f0a7d83c8bf271007e8b8b0a" alt=""
You can get almost the same result using pattern matching. For example, your "foo:bar;baz".partition(":", ";") can be done by a well-known matching idiom: re.match(r'([^:]*):([^;]*);(.*)', 'foo:bar;baz').groups()
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Sat, Jan 07, 2023 at 10:48:48AM -0800, Peter Ludemann wrote:
"Well-known" he says :-) I think that is a perfect example of the ability to use regexes for obfuscation. It gets worse if you want to partition on a regex metacharacter like '.' I think that the regex solution is also wrong because it requires you to know *exactly* what order the separators are found in the source string. If we swap the semi-colon and the colon in the source, but not the pattern, the idiom fails: >>> re.match(r'([^:]*):([^;]*);(.*)', 'foo;bar:baz').groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups' So that makes it useless for the case where you want to split of any of a number of separators, but don't know which order they occur in. You call it "almost the same result" but it is nothing like the result from partition. The separators are lost, and it splits the string all at once instead of one split per call. I think this would be a closer match: ```
re.split(r'[:;]', 'foo:bar;baz', maxsplit=1) ['foo', 'bar;baz']
but even there we lose the information of which separator was
partitioned on.
--
Steve
data:image/s3,"s3://crabby-images/437f2/437f272b4431eff84163c664f9cf0d7ba63c3b32" alt=""
Steven D'Aprano writes:
It *is* well-known to those who know. Just because you don't like regex doesn't mean it's not well-known. I wouldn't use that idiom though; I'd use an explicit character class in most cases I encounter.
But that's characteristic of many examples. In "structured" mail headers like Content-Type, you want the separators to come in the order ':', '=', ';'. In a URI scheme with an authority component, you want them in the order '@', ':'. Except that you don't, in both those examples. In Content-Type, the '=' is optional, and there may be multiple ';'. In authority, the existing ':' is optional, and there's an optional ':' to separate password from username before the '@'. And it gets worse: in the authority case, the username is optional. In the common case of anonymous access, the username is omitted, so user, _, domain = "example.com".partition('@') does the wrong thing!
Examples where the order of separators doesn't matter? In most of the examples I need, swapping order is a parse error.
You call it "almost the same result" but it is nothing like the result from partition. The separators are lost,
Trivial to fix, just add parens, in the simpler grouping form as a bonus! I'm not asking you to like the resulting regexp better, just pointing out that your dislike of regex is driving the discussion in unprofitable directions.
and it splits the string all at once instead of one split per call.
So does the original proposal, that's part of the point of it, I think. I really don't see any of the variations on the proposal as a particularly valuable addition. It's already easy to screw up your parse with str.partition (the authority example: although you can fix the order problem with '@' by using str.rpartition, the multiple optional ':' mean that whichever r?partition you use, you can get it wrong unless you check the order of '@' and ':', so you have to use a recursive parse, not a sequential parse). But you can write a regex version of authority to give a sequence of tokens rather than a parse, and you convert that into a parse by checking each element of the sequence for None in a deterministic order. I prefer the latter approach (Emacs user since Emacs was programmed in TECO), but as long as you allow me to use regex for character classes and sequences, I can live with retrictions on use of regex in the style guide. Parsing is hard. Both regex and r?partition are best used as low- level tools for tokenizing, and you're asking for trouble if you try to use them for parsing past a certain point. My breaking point for regex is somewhere around the authority example, but I wouldn't push back if my project's style guide said to to break that up. I *would* however often prefer regexp to r?partition because it would allow character classes, and in most of the areas I work with (mail, URIs, encodings) being able to detect lexical errors by using character classes is helpful. And I would prefer "one bite per call" partition to a partition at multiple points. Where I'm being pretty fuzzy, the .split methods are fine. -- Yet another Steve
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Sun, Jan 08, 2023 at 05:30:30PM +0900, Stephen J. Turnbull wrote:
I like regexes plenty, for what they are good for. But my *liking* them or not is irrelevant as to whether this example is "well-known" or not. I'm not the heaviest regex user in the world, but I've used my share, and I've never seen this particular line noise before. (Hey, I like Forth. Sometimes line noise is great.) I mean, if all you are doing is splitting the source by some separators regardless of order, surely this does the same job and is *vastly* more obvious?
re.split(r'[:;]', 'foo:bar;baz') ['foo', 'bar', 'baz']
If the order matters:
re.match('(.*):(.*);(.*)', 'foo:bar;baz').groups() ('foo', 'bar', 'baz')
Or use non-greedy wildcards if you need them:
re.match('(.*?):(.*?);(.*)', 'foo:b:ar;ba;z').groups() ('foo', 'b:ar', 'ba;z')
Great. Then for *those* structured examples you can happily write your regex and put the separators in the order you expect. But I'm talking about *unstructured* examples where you don't know the order of the separators, you want to split on whichever one comes first regardless of the order, and you need to know which separator that was. [...]
Examples where the order of separators doesn't matter? In most of the examples I need, swapping order is a parse error.
Okay, then you *mostly* don't need this.
str.partition does *one* three way split, into (head, sep, tail). If you want to continue to partition the tail, you have to call it again. To me, that fixed "one bite per call" design is fundamental to partition(). If we wanted an arbitrary number of splits we'd use, um, split() :-) Of course we can debate the pros and cons of each, that's what this thread is for.
Right! I agree! And that is why I want partition to accept multiple separators and split on the first one found. I find myself needing to do that, well, not "all the time" by any means, but often enough that its an itch I want scratched.
My breaking point for regex is somewhere around the authority example,
Heh, I've written much more complicated examples. It was kinda fun, until I came back to it a month later and couldn't understand what the hell it did! :-)
I'm not sure I quite understand you there, but if I do, I would prefer to split the string and then validate the head and tail afterwards, rather than just have the regex fail.
I think we agree here. -- Steve
data:image/s3,"s3://crabby-images/437f2/437f272b4431eff84163c664f9cf0d7ba63c3b32" alt=""
Steven D'Aprano writes:
"Obvious" yes, but it's also easy to invest that call with semantics (eg, "just three segments because that's the allowed syntax") that it doesn't possess. You haven't stated how many elements it should be split into, nor whether the separator characters are permitted in components, nor whether this component is the whole input and this regexp defines the whole syntax. The point of the "well-known idiom" is to specify most of that (and it doesn't take much much more to specify all of it, specifying "no separators in components" is the most space-consuming part of the expression!) Your other alternatives have the same potential issues.
That's easy enough to do with a (relatively unknown to some ;-) regular expression: re.match("([^;:]*)([;:])(.*)", source) The question is whether the need is frequent enough and that's hard enough to understand / ugly enough to warrant another method or an incompatible extension to str.partition (and str.rpartition).[1]
I already knew that. Without real examples, I can't judge whether I'm pro-status quo or pro-serving-the-nonuniversal-but-still-useful-case.
I'm much more favorable to proposals where str.partition and str.rpartition split at *one* point, but the OP seemed intended to do more work (but not arbitrary amounts!) per call.
For me, often that depends on how hard I'm willing to work to support users. If the only user is myself, that's very often zero. In the case of the "well-known idiom", the only ways the regexp can fail involve wrong number of separators. I'd be willing to impose that burden on users with a "wrong number of separators" message. Another case is where I want an efficient parser for the vast majority of conformant cases and am willing to do redundant work for the error cases. Footnotes: [1] Here "incompatible" means that people writing code that must support previous versions of Python can't use it.
data:image/s3,"s3://crabby-images/83a8a/83a8a0c62260da7577535e126c8c978fd42131b8" alt=""
On Sun, 8 Jan 2023 at 08:32, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
+1 (while also recognising the caveats you mention subsequently)
Trying to avoid the usual discussions about permissive parsing / supporting various implementations in-the-wild: long-term, the least ambiguous and most computationally-efficient environment would probably want to reduce special cases like that? (both in-data and in-code)
user, _, domain = "example.com".partition('@')
does the wrong thing!
Yep - it's important to choose partition arguments (I'm mostly-resisting the temptation to call them a 'pattern') that are appropriate for the input. Structural pattern matching _seems_ like it could correspond here, in terms of selecting appropriate arguments -- but it is, as I understand it, limited to at-most-one wildcard pattern per match (by sensible design).
I would prefer "one bite per call" partition to a partition at multiple points.
That does seem clearer - and clearer is, generally, probably better. I suppose an analysis (that I don't have the ability to perform easily) could be to determine how many regular expression codesites could be migrated compatibly and beneficially by using multiple-partition-arguments.
data:image/s3,"s3://crabby-images/437f2/437f272b4431eff84163c664f9cf0d7ba63c3b32" alt=""
James Addison via Python-ideas writes:
On Sun, 8 Jan 2023 at 08:32, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
That's not very human-friendly, though. Push that to extremes and you get XML. "Nobody expects the XML Validators!"
If I understand what you mean by "structural pattern matching", that seems more appropriate to parsing already tokenized input.
My guess is that for re.match (or re.search) it would be relatively few. People tend to reach for regular expression matching when they have repetition or alternatives that they want to capture in a single expression, and that is generally not going to be easy to capture with str.partition. But I bet that *many* calls to re.split take regular expressions of the form f'[{separators}]' which would be easy enough to search for. That's where you could reduce the number of regexps. Steve
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Sat, Jan 07, 2023 at 10:48:48AM -0800, Peter Ludemann wrote:
"Well-known" he says :-) I think that is a perfect example of the ability to use regexes for obfuscation. It gets worse if you want to partition on a regex metacharacter like '.' I think that the regex solution is also wrong because it requires you to know *exactly* what order the separators are found in the source string. If we swap the semi-colon and the colon in the source, but not the pattern, the idiom fails: >>> re.match(r'([^:]*):([^;]*);(.*)', 'foo;bar:baz').groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups' So that makes it useless for the case where you want to split of any of a number of separators, but don't know which order they occur in. You call it "almost the same result" but it is nothing like the result from partition. The separators are lost, and it splits the string all at once instead of one split per call. I think this would be a closer match: ```
re.split(r'[:;]', 'foo:bar;baz', maxsplit=1) ['foo', 'bar;baz']
but even there we lose the information of which separator was
partitioned on.
--
Steve
data:image/s3,"s3://crabby-images/437f2/437f272b4431eff84163c664f9cf0d7ba63c3b32" alt=""
Steven D'Aprano writes:
It *is* well-known to those who know. Just because you don't like regex doesn't mean it's not well-known. I wouldn't use that idiom though; I'd use an explicit character class in most cases I encounter.
But that's characteristic of many examples. In "structured" mail headers like Content-Type, you want the separators to come in the order ':', '=', ';'. In a URI scheme with an authority component, you want them in the order '@', ':'. Except that you don't, in both those examples. In Content-Type, the '=' is optional, and there may be multiple ';'. In authority, the existing ':' is optional, and there's an optional ':' to separate password from username before the '@'. And it gets worse: in the authority case, the username is optional. In the common case of anonymous access, the username is omitted, so user, _, domain = "example.com".partition('@') does the wrong thing!
Examples where the order of separators doesn't matter? In most of the examples I need, swapping order is a parse error.
You call it "almost the same result" but it is nothing like the result from partition. The separators are lost,
Trivial to fix, just add parens, in the simpler grouping form as a bonus! I'm not asking you to like the resulting regexp better, just pointing out that your dislike of regex is driving the discussion in unprofitable directions.
and it splits the string all at once instead of one split per call.
So does the original proposal, that's part of the point of it, I think. I really don't see any of the variations on the proposal as a particularly valuable addition. It's already easy to screw up your parse with str.partition (the authority example: although you can fix the order problem with '@' by using str.rpartition, the multiple optional ':' mean that whichever r?partition you use, you can get it wrong unless you check the order of '@' and ':', so you have to use a recursive parse, not a sequential parse). But you can write a regex version of authority to give a sequence of tokens rather than a parse, and you convert that into a parse by checking each element of the sequence for None in a deterministic order. I prefer the latter approach (Emacs user since Emacs was programmed in TECO), but as long as you allow me to use regex for character classes and sequences, I can live with retrictions on use of regex in the style guide. Parsing is hard. Both regex and r?partition are best used as low- level tools for tokenizing, and you're asking for trouble if you try to use them for parsing past a certain point. My breaking point for regex is somewhere around the authority example, but I wouldn't push back if my project's style guide said to to break that up. I *would* however often prefer regexp to r?partition because it would allow character classes, and in most of the areas I work with (mail, URIs, encodings) being able to detect lexical errors by using character classes is helpful. And I would prefer "one bite per call" partition to a partition at multiple points. Where I'm being pretty fuzzy, the .split methods are fine. -- Yet another Steve
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Sun, Jan 08, 2023 at 05:30:30PM +0900, Stephen J. Turnbull wrote:
I like regexes plenty, for what they are good for. But my *liking* them or not is irrelevant as to whether this example is "well-known" or not. I'm not the heaviest regex user in the world, but I've used my share, and I've never seen this particular line noise before. (Hey, I like Forth. Sometimes line noise is great.) I mean, if all you are doing is splitting the source by some separators regardless of order, surely this does the same job and is *vastly* more obvious?
re.split(r'[:;]', 'foo:bar;baz') ['foo', 'bar', 'baz']
If the order matters:
re.match('(.*):(.*);(.*)', 'foo:bar;baz').groups() ('foo', 'bar', 'baz')
Or use non-greedy wildcards if you need them:
re.match('(.*?):(.*?);(.*)', 'foo:b:ar;ba;z').groups() ('foo', 'b:ar', 'ba;z')
Great. Then for *those* structured examples you can happily write your regex and put the separators in the order you expect. But I'm talking about *unstructured* examples where you don't know the order of the separators, you want to split on whichever one comes first regardless of the order, and you need to know which separator that was. [...]
Examples where the order of separators doesn't matter? In most of the examples I need, swapping order is a parse error.
Okay, then you *mostly* don't need this.
str.partition does *one* three way split, into (head, sep, tail). If you want to continue to partition the tail, you have to call it again. To me, that fixed "one bite per call" design is fundamental to partition(). If we wanted an arbitrary number of splits we'd use, um, split() :-) Of course we can debate the pros and cons of each, that's what this thread is for.
Right! I agree! And that is why I want partition to accept multiple separators and split on the first one found. I find myself needing to do that, well, not "all the time" by any means, but often enough that its an itch I want scratched.
My breaking point for regex is somewhere around the authority example,
Heh, I've written much more complicated examples. It was kinda fun, until I came back to it a month later and couldn't understand what the hell it did! :-)
I'm not sure I quite understand you there, but if I do, I would prefer to split the string and then validate the head and tail afterwards, rather than just have the regex fail.
I think we agree here. -- Steve
data:image/s3,"s3://crabby-images/437f2/437f272b4431eff84163c664f9cf0d7ba63c3b32" alt=""
Steven D'Aprano writes:
"Obvious" yes, but it's also easy to invest that call with semantics (eg, "just three segments because that's the allowed syntax") that it doesn't possess. You haven't stated how many elements it should be split into, nor whether the separator characters are permitted in components, nor whether this component is the whole input and this regexp defines the whole syntax. The point of the "well-known idiom" is to specify most of that (and it doesn't take much much more to specify all of it, specifying "no separators in components" is the most space-consuming part of the expression!) Your other alternatives have the same potential issues.
That's easy enough to do with a (relatively unknown to some ;-) regular expression: re.match("([^;:]*)([;:])(.*)", source) The question is whether the need is frequent enough and that's hard enough to understand / ugly enough to warrant another method or an incompatible extension to str.partition (and str.rpartition).[1]
I already knew that. Without real examples, I can't judge whether I'm pro-status quo or pro-serving-the-nonuniversal-but-still-useful-case.
I'm much more favorable to proposals where str.partition and str.rpartition split at *one* point, but the OP seemed intended to do more work (but not arbitrary amounts!) per call.
For me, often that depends on how hard I'm willing to work to support users. If the only user is myself, that's very often zero. In the case of the "well-known idiom", the only ways the regexp can fail involve wrong number of separators. I'd be willing to impose that burden on users with a "wrong number of separators" message. Another case is where I want an efficient parser for the vast majority of conformant cases and am willing to do redundant work for the error cases. Footnotes: [1] Here "incompatible" means that people writing code that must support previous versions of Python can't use it.
data:image/s3,"s3://crabby-images/83a8a/83a8a0c62260da7577535e126c8c978fd42131b8" alt=""
On Sun, 8 Jan 2023 at 08:32, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
+1 (while also recognising the caveats you mention subsequently)
Trying to avoid the usual discussions about permissive parsing / supporting various implementations in-the-wild: long-term, the least ambiguous and most computationally-efficient environment would probably want to reduce special cases like that? (both in-data and in-code)
user, _, domain = "example.com".partition('@')
does the wrong thing!
Yep - it's important to choose partition arguments (I'm mostly-resisting the temptation to call them a 'pattern') that are appropriate for the input. Structural pattern matching _seems_ like it could correspond here, in terms of selecting appropriate arguments -- but it is, as I understand it, limited to at-most-one wildcard pattern per match (by sensible design).
I would prefer "one bite per call" partition to a partition at multiple points.
That does seem clearer - and clearer is, generally, probably better. I suppose an analysis (that I don't have the ability to perform easily) could be to determine how many regular expression codesites could be migrated compatibly and beneficially by using multiple-partition-arguments.
data:image/s3,"s3://crabby-images/437f2/437f272b4431eff84163c664f9cf0d7ba63c3b32" alt=""
James Addison via Python-ideas writes:
On Sun, 8 Jan 2023 at 08:32, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
That's not very human-friendly, though. Push that to extremes and you get XML. "Nobody expects the XML Validators!"
If I understand what you mean by "structural pattern matching", that seems more appropriate to parsing already tokenized input.
My guess is that for re.match (or re.search) it would be relatively few. People tend to reach for regular expression matching when they have repetition or alternatives that they want to capture in a single expression, and that is generally not going to be easy to capture with str.partition. But I bet that *many* calls to re.split take regular expressions of the form f'[{separators}]' which would be easy enough to search for. That's where you could reduce the number of regexps. Steve
participants (4)
-
James Addison
-
Peter Ludemann
-
Stephen J. Turnbull
-
Steven D'Aprano