[Python-ideas] Re: Multiple arguments to str.partition and bytes.partition

Jan. 8, 2023

      Steven D'Aprano writes:
...
On Sat, Jan 07, 2023 at 10:48:48AM -0800, Peter Ludemann wrote:
...
You can get almost the same result using pattern matching. For example, your
"foo:bar;baz".partition(":", ";")
can be done by a well-known matching idiom:
re.match(r'([^:]*):([^;]*);(.*)', 'foo:bar;baz').groups()
"Well-known" he says :-)
It *is* well-known to those who know.  Just because you don't like
regex doesn't mean it's not well-known.  I wouldn't use that idiom
though; I'd use an explicit character class in most cases I encounter.
...
I think that the regex solution is also wrong because it requires you 
to know *exactly* what order the separators are found in the source 
string.
But that's characteristic of many examples.  In "structured" mail
headers like Content-Type, you want the separators to come in the
order ':', '=', ';'.  In a URI scheme with an authority component, you
want them in the order '@', ':'.  Except that you don't, in both those
examples.  In Content-Type, the '=' is optional, and there may be
multiple ';'.  In authority, the existing ':' is optional, and there's
an optional ':' to separate password from username before the '@'.

And it gets worse: in the authority case, the username is optional.
In the common case of anonymous access, the username is omitted, so

user, _, domain = "example.com".partition('@')

does the wrong thing!
...
If we swap the semi-colon and the colon in the source, but not 
the pattern, the idiom fails:
>>> re.match(r'([^:]*):([^;]*);(.*)', 'foo;bar:baz').groups()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'NoneType' object has no attribute 'groups'
So that makes it useless for the case where you want to split of any of 
a number of separators, but don't know which order they occur in.
Examples where the order of separators doesn't matter?  In most of the
examples I need, swapping order is a parse error.
...
You call it "almost the same result" but it is nothing like the result 
from partition. The separators are lost,
Trivial to fix, just add parens, in the simpler grouping form as a
bonus!  I'm not asking you to like the resulting regexp better, just
pointing out that your dislike of regex is driving the discussion in
unprofitable directions.
...
and it splits the string all at once instead of one split per call.
So does the original proposal, that's part of the point of it, I
think.

I really don't see any of the variations on the proposal as a
particularly valuable addition.  It's already easy to screw up your
parse with str.partition (the authority example: although you can fix
the order problem with '@' by using str.rpartition, the multiple
optional ':' mean that whichever r?partition you use, you can get it
wrong unless you check the order of '@' and ':', so you have to use a
recursive parse, not a sequential parse).  But you can write a regex
version of authority to give a sequence of tokens rather than a parse,
and you convert that into a parse by checking each element of the
sequence for None in a deterministic order.  I prefer the latter
approach (Emacs user since Emacs was programmed in TECO), but as long
as you allow me to use regex for character classes and sequences, I
can live with retrictions on use of regex in the style guide.

Parsing is hard.  Both regex and r?partition are best used as low-
level tools for tokenizing, and you're asking for trouble if you try
to use them for parsing past a certain point.  My breaking point for
regex is somewhere around the authority example, but I wouldn't push
back if my project's style guide said to to break that up.  I *would*
however often prefer regexp to r?partition because it would allow
character classes, and in most of the areas I work with (mail, URIs,
encodings) being able to detect lexical errors by using character
classes is helpful.  And I would prefer "one bite per call" partition
to a partition at multiple points.  Where I'm being pretty fuzzy, the
.split methods are fine.

-- Yet another Steve

[Python-ideas] Re: Multiple arguments to str.partition and bytes.partition

Stephen J. Turnbull