Re: [Python-Dev] Proof of the pudding: str.partition()
At 04:55 AM 8/31/2005 -0700, Michael Chermside wrote:
Raymond's original definition for partition() did NOT support any of the following:
(*) Regular Expressions
This can be orthogonally added to the 're' module, and definitely should not be part of the string method.
(*) Ways to generate just 1 or 2 of the 3 values if some are not going to be used
Yep, subscripting and slicing are more than adequate to handle *all* of those use cases, even the ones that some people have been jumping through odd hoops to express: before = x.partition(sep)[0] found = x.partition(sep)[1] after = x.partition(sep)[2] before, found = x.partition("foo")[:2] found, after = x.partition("foo")[1:] before, after = x.partition("foo")[::2] Okay, that last one is maybe a little too clever. I'd personally just use '__' or 'DONTCARE' or something like that for the value(s) I didn't care about, because it actually takes slightly less time to unpack a 3-tuple into three function-local variables than it does to pull out a single element of the tuple, and it's almost twice as fast as taking a slice and unpacking it into two variables. So, using three variables is both faster *and* easier to read than any of the variations anybody has proposed, including the ones I just showed above.
(*) Clever use of indices to avoid copying strings (*) Behind-the-scenes tricks to allow repeated re-partitioning to be faster than O(n^2)
Yep, -1 on these.
The absence of these "features" is a GOOD thing. It makes the behavior of partition() so simple and straightforward that it is easily documented and can be instantly understood by a competent programmer. I *like* keeping it simple. In fact, if anything, I'd give UP the one fancy feature he chose to include:
(*) An rpartition() function that searches from the right
...except that I understand why he included it and am convinced by the arguments (use cases can be demonstrated and people would expect it to be there and complain if it weren't).
I'd definitely like to keep rpartition. For example, splitting an HTTP url's hostname from its port should be done with rpartition, since you can have a 'username:password@' part before the host, and because the host can be a bracketed bracketed IPv6 host address with colons in it.
Simplicity and elegence are two of the reasons that this is such an excellent proposal,
+1.
Phillip J. Eby wrote:
Yep, subscripting and slicing are more than adequate to handle *all* of those use cases, even the ones that some people have been jumping through odd hoops to express:
before = x.partition(sep)[0] found = x.partition(sep)[1] after = x.partition(sep)[2]
before, found = x.partition("foo")[:2] found, after = x.partition("foo")[1:] before, after = x.partition("foo")[::2]
Okay, that last one is maybe a little too clever. I'd personally just use '__' or 'DONTCARE' or something like that for the value(s) I didn't care about, because it actually takes slightly less time to unpack a 3-tuple into three function-local variables than it does to pull out a single element of the tuple, and it's almost twice as fast as taking a slice and unpacking it into two variables.
you're completely missing the point. the problem isn't the time it takes to unpack the return value, the problem is that it takes time to create the substrings that you don't need. for some use cases, a naive partition-based solution is going to be a lot slower than the old find+slice approach, no matter how you slice, index, or unpack the return value.
So, using three variables is both faster *and* easier to read than any of the variations anybody has proposed, including the ones I just showed above.
try again. </F>
Fredrik Lundh wrote:
Phillip J. Eby wrote:
Yep, subscripting and slicing are more than adequate to handle *all* of those use cases, even the ones that some people have been jumping through odd hoops to express:
before = x.partition(sep)[0] found = x.partition(sep)[1] after = x.partition(sep)[2]
before, found = x.partition("foo")[:2] found, after = x.partition("foo")[1:] before, after = x.partition("foo")[::2]
Okay, that last one is maybe a little too clever. I'd personally just use '__' or 'DONTCARE' or something like that for the value(s) I didn't care about, because it actually takes slightly less time to unpack a 3-tuple into three function-local variables than it does to pull out a single element of the tuple, and it's almost twice as fast as taking a slice and unpacking it into two variables.
you're completely missing the point.
the problem isn't the time it takes to unpack the return value, the problem is that it takes time to create the substrings that you don't need.
Indeed, and therefore the performance of rpartition is likely to get worse as the length of the input strung increases. I don't like to think about all those strings being created just to be garbage-collected. Pity the poor CPU ... :-)
for some use cases, a naive partition-based solution is going to be a lot slower than the old find+slice approach, no matter how you slice, index, or unpack the return value.
Yup. Then it gets down to statistical arguments about the distribution of use cases and input lengths. If we had a type that represented a substring of an existing string it might avoid the stress, but I'm not sure I see that one flying.
So, using three variables is both faster *and* easier to read than any of the variations anybody has proposed, including the ones I just showed above.
try again.
The collective brainpower that's been exercised on this one enhancement already must be phenomenal, but the proposal still isn't perfect. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/
Steve Holden
Fredrik Lundh wrote:
the problem isn't the time it takes to unpack the return value, the problem is that it takes time to create the substrings that you don't need.
Indeed, and therefore the performance of rpartition is likely to get worse as the length of the input strung increases. I don't like to think about all those strings being created just to be garbage-collected. Pity the poor CPU ... :-)
for some use cases, a naive partition-based solution is going to be a lot slower than the old find+slice approach, no matter how you slice, index, or unpack the return value.
Yup. Then it gets down to statistical arguments about the distribution of use cases and input lengths. If we had a type that represented a substring of an existing string it might avoid the stress, but I'm not sure I see that one flying.
What about buffer()? Tack on some string methods and you get a string slice-like instance with very low memory requirements. Add on actual comparisons of buffers and strings, and you can get nearly everything desired with very low memory overhead. A bit of free thought brings me to the (half-baked) idea that if string methods accepted any object which conformed to the buffer interface; mmap, buffer, array, ... instances could gain all of the really convenient methods that make strings the objects to use in many cases. If one wanted to keep string methods returning strings, and other objects with the buffer protocol which use string methods returning buffer objects, that seems reasonable (and probably a good idea). - Josiah P.S. Pardon me if the idea is pure insanity, I haven't been getting much sleep lately, and just got up from a nap that seems to have clouded my judgement (I just put milk in my juice...).
Josiah Carlson wrote:
A bit of free thought brings me to the (half-baked) idea that if string methods accepted any object which conformed to the buffer interface; mmap, buffer, array, ... instances could gain all of the really convenient methods that make strings the objects to use in many cases.
Not a bad idea, but they couldn't literally be string methods. They'd have to be standalone functions like we used to have in the string module before it got mercilessly deprecated. :-) Not sure what happens to this when the unicode/bytearray future arrives, though. Treating a buffer of bytes as a character string isn't going to be so straightforward then. Greg
Greg Ewing
Josiah Carlson wrote:
A bit of free thought brings me to the (half-baked) idea that if string methods accepted any object which conformed to the buffer interface; mmap, buffer, array, ... instances could gain all of the really convenient methods that make strings the objects to use in many cases.
Not a bad idea, but they couldn't literally be string methods. They'd have to be standalone functions like we used to have in the string module before it got mercilessly deprecated. :-)
Not sure what happens to this when the unicode/bytearray future arrives, though. Treating a buffer of bytes as a character string isn't going to be so straightforward then.
Here's my thought: One could modify string methods to check the type of the input (string, unicode, or other). That check turns on a flag for whether the method returns are string, unicode, or buffers. One uses PyObject_AsBuffer() methods to pull the char* and length for any input offering the buffer protocol. Now here's the fun part: One makes the methods aware of the type of the self parameter. One sets the 'split' method for the buffer object to be 'string_split', etc. Unicode does indeed get tricky, how does one handle buffers of unicode objects? Right now, you get the raw pointer and underlying length ( len (buffer(u'hello')) == 10 ). If there was a unicode buffer (perhaps ubuffer), that would work, but I'm not sure I really like it. I notice much of the discussion on 'string views', which to me seems like another way of saying 'buffer', and if there is a 'string view', there would necessarily need to be a 'unicode view'. As for the bytes type, from what I understand, they should directly support buffers without issue. - Josiah
for some use cases, a naive partition-based solution is going to be a lot slower than the old find+slice approach, no matter how you slice, index, or unpack the return value.
The index+slice approach will still be available for such cases. I am sure we will see relative speed versus string size benchmarks. Terry J. Reedy
[Steve Holden]
The collective brainpower that's been exercised on this one enhancement already must be phenomenal, but the proposal still isn't perfect.
Sure it is :-) It was never intended to replace all string parsing functions, existing or contemplated. We still have str.index() so people can do low-level index tracking, optimization, or whatnot. Likewise, str.split() and regexps remain better choices for some apps. The discussion has centered around the performance cost of returning three strings when two or fewer are actually used.
From that discussion, we can immediately eliminate the center string case as it is essentially cost-free (it is either an empty string or a reference to, not a copy of the separator argument).
Another case that is no cause for concern is when one of the substrings is often, but not always empty. Consider comments stripping for example: # XXX a real parser would need to skip over # in string literals for line in open('demo.py'): line, sep, comment = line.partition('#') print line On most lines, the comment string is empty, so no time is lost copying a long substring that won't be used. On the lines with a comment, I like having it because it makes the code easier to debug/maintain (making it trivial to print, log, or store the comment string). Similar logic applies to other cases where the presence of a substring is an all or nothing proposition, such as cgi scripts extracting the command string when present: line, cmdfound, command = line.rpartition('?'). If not found, you've wasted nothing (the command string is empty). If found, you've gotten what you were going to slice-out anyway. Also, there are plenty of use cases that only involve short strings (parsing urls, file paths, splitting name/value pairs, etc). The cost of ignoring a component for these short inputs is small. That leaves the case where the strings are long and parsing is repeated with the same separator. The answer here is to simply NOT use partition(). Don't write: while s: line, _, s = s.partition(sep) . . . Instead, you almost always do better with for line in s.split(sep): . . . or with re.finditer() if memory consumption is an issue. Remember, adding partition() doesn't take away anything else you have now (even if str.find() disappears, you still have str.index()). Also, its inclusion does not preclude more specialized methods like str.before(sep) or str.after(sep) if someone is able to prove their worth. What str.partition() does do well is simplify code by encapsulating several variations of a common multi-step, low-level programming pattern. It should be accepted on that basis rather than being rejected because it doesn't also replace re.finditer() or str.split(). Because there are so many places were partition() is a clear improvement, I'm not bothered when someone concocts a case where it isn't the tool of choice. Accept it for what it is, not what it is not. Raymond
(*) Regular Expressions
This can be orthogonally added to the 're' module, and definitely should not be part of the string method.
Sounds right to me, and it *should* be orthogonally added to the 're' module coincidentally simultaneously with the change to the string object :-). I have to say, it would be nice if "foo bar".partition(re.compile('\s')) would work. That is, if the argument is an re pattern object instead of a string, it would be nice if it were understood appropriately, just for symmetry's sake. But it's hardly necessary. Presumably in the re module, there would be a function like re.partition("\s", "foo bar") for one-shot usage, or the expression re.compile('\s').partition("foo bar") Bill
Bill Janssen wrote:
(*) Regular Expressions
This can be orthogonally added to the 're' module, and definitely should not be part of the string method.
Sounds right to me, and it *should* be orthogonally added to the 're' module coincidentally simultaneously with the change to the string object :-).
I have to say, it would be nice if
"foo bar".partition(re.compile('\s'))
would work. That is, if the argument is an re pattern object instead of a string, it would be nice if it were understood appropriately, just for symmetry's sake. But it's hardly necessary.
And it's horrible, for none of the other string methods accept a RE. In Python, RE functionality is in the re module and nowhere else, and this is a Good Thing. There are languages which give REs too much weight by philosophy (hint, hint), but Python isn't one of them. Interestingly, Python programmers suffer less from the "help me, my RE doesn't work" problem. Reinhold -- Mail address is perfectly valid!
Reinhold Birkenfeld writes:
And it's horrible, for none of the other string methods accept a RE.
I suppose it depends on your perspective as to what exactly is horrible. I tend to think it's too bad that none of the other string methods accept appropriate RE patterns. Strings are thought of as "core", whereas RE, a relatively new part of the stdlib, isn't. But it's OK -- it just gives the system more Java-ness, where you have lots of little modules, each of which does something slightly different.
There are languages which give REs too much weight by philosophy (hint, hint), but Python isn't one of them. Interestingly, Python programmers suffer less from the "help me, my RE doesn't work" problem.
Yes, but perhaps the causative bug in those "other" languages is the confusion between string *literals* and RE *literals*, which isn't a problem in the idiom I suggested. Or perhaps if RE was more helpful in Python, Python programmers would indeed suffer from the same problem. Bill
participants (9)
-
Bill Janssen
-
Fredrik Lundh
-
Greg Ewing
-
Josiah Carlson
-
Phillip J. Eby
-
Raymond Hettinger
-
Reinhold Birkenfeld
-
Steve Holden
-
Terry Reedy