Re: [Python-Dev] Proof of the pudding: str.partition()
Fredrik Lundh wrote:
the problem isn't the time it takes to unpack the return value, the problem is that it takes time to create the substrings that you don't need.
I'm actually starting to think that this may be a good use case for views of strings i.e. rather than create 3 new strings, each "string" is a view onto the string that was partitioned. Most of the use cases I've seen, the partitioned bits are discarded almost as soon as the original string, and often the original string persists beyond the partitioned bits. Tim Delaney
Tim> I'm actually starting to think that this may be a good use case for Tim> views of strings i.e. rather than create 3 new strings, each Tim> "string" is a view onto the string that was partitioned. How would this work? One of the advantages of the current string is that the underlying data is NUL-terminated, so when passing strings to C routines no copying is required. Suppose I executed scheme, _, rest = "http://www.python.org/".partition(':') As a Python programmer I'd get back what look like three strings: "http", ":", and "//www.python.org/". If each of them was a view onto part of the original string, only the last one would truly refer to a NUL-terminated sequence of characters. If I then wanted to see what scheme's value compared to, the string's comparison method would have to recognize that it wasn't truly NUL-terminated, copy it, call strncmp() or whatever underlying routine is used for string comparisons. (Maybe string comparisons are done inline. I'm sure there are some examples where the underlying C string routines are called.) OTOH, maybe that would work. Perhaps we should try it. Skip
Skip> OTOH, maybe that would work. Perhaps we should try it. Ah, I forgot the data is part of the PyString object itself, not stored as a separate char* array. Without a char* in the object it's kind of hard to do views. Skip
skip@pobox.com wrote:
Ah, I forgot the data is part of the PyString object itself, not stored as a separate char* array. Without a char* in the object it's kind of hard to do views.
That wouldn't be a problem if substrings were a separate subclass of basestring with their own representation. That's probably a good idea anyway, since you wouldn't want slicing to return substrings by default -- it should be something you have to explicitly ask for. Greg
On 8/31/05, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
skip@pobox.com wrote:
Ah, I forgot the data is part of the PyString object itself, not stored as a separate char* array. Without a char* in the object it's kind of hard to do views.
That wouldn't be a problem if substrings were a separate subclass of basestring with their own representation. That's probably a good idea anyway, since you wouldn't want slicing to return substrings by default -- it should be something you have to explicitly ask for.
You all are reinventing NSString. That's the NextStep string type used by ObjC. PyObjC bridges to NSString with some difficulty. I have never used this myself, but from Donovan Preston I understand that NSString is just a base class or an interface or something like that and many different implementations / subclasses exist. Donovan has suggested that we adopt something similar for Python -- I presume in part to make his life wrapping NSString easier, but at least in part because the concept really works well in ObjC. I'm not saying to go either way yet. I'm wary of complexifications of the string implementation based on a horriffically complex implementation in ABC that was proven to be asymptotically optimal, but unfortunately was beat every time in practical applications by something much simpler, *and* the algorithm was so complex that we couldn't get the code 100% bugfree. But that was 20 years ago. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
skip@pobox.com wrote:
If I then wanted to see what scheme's value compared to, the string's comparison method would have to recognize that it wasn't truly NUL-terminated, copy it, call strncmp() or whatever underlying routine is used for string comparisons.
Python string comparisons can't be using anything that relies on nul-termination, because Python strings can contain embedded nuls. Possibly it uses memcmp(), but that takes a length. You have a point when it comes to passing strings to other C routines, though. For those that don't have a variant which takes a maximum length, the substring type might have to keep a cached nul-terminated copy created on demand. Then the copying overhead would only be incurred if you did happen to pass a substring to such a routine. Greg
Greg Ewing wrote:
skip@pobox.com wrote:
If I then wanted to see what scheme's value compared to, the string's comparison method would have to recognize that it wasn't truly NUL-terminated, copy it, call strncmp() or whatever underlying routine is used for string comparisons.
Python string comparisons can't be using anything that relies on nul-termination, because Python strings can contain embedded nuls. Possibly it uses memcmp(), but that takes a length.
You have a point when it comes to passing strings to other C routines, though. For those that don't have a variant which takes a maximum length, the substring type might have to keep a cached nul-terminated copy created on demand. Then the copying overhead would only be incurred if you did happen to pass a substring to such a routine.
Since Python strings *can* contain embedded NULs, doesn't that rather poo on the idea of passing pointers to their data to C functions as things stand? regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/
"Steve" == Steve Holden <steve@holdenweb.com> writes:
Steve> Since Python strings *can* contain embedded NULs, doesn't Steve> that rather poo on the idea of passing pointers to their Steve> data to C functions as things stand? I think it's a "consenting adults" issue. Ie, C programmers always face the issue of "Do I dare strfry() this char[]?" I don't see what difference it makes that the C program in question is being linked with Python, or that the source of the data is a Python string. He's chosen to program in C, let him get on with it. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
Steve Holden wrote:
Since Python strings *can* contain embedded NULs, doesn't that rather poo on the idea of passing pointers to their data to C functions as things stand?
If a Python function is clearly wrapping a C function, one doesn't expect to be able to pass strings with embedded NULs to it. Just because a Python string can contain embedded NULs doesn't mean it makes sense to use such strings in all circumstances. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg.ewing@canterbury.ac.nz +--------------------------------------+
Greg> If a Python function is clearly wrapping a C function, one doesn't Greg> expect to be able to pass strings with embedded NULs to it. Isn't that just floating an implementation detail up to the programmer (who may well not be POSIX- or Unix-aware)?
skip@pobox.com wrote:
Greg> If a Python function is clearly wrapping a C function, one doesn't Greg> expect to be able to pass strings with embedded NULs to it.
Isn't that just floating an implementation detail up to the programmer (who may well not be POSIX- or Unix-aware)?
As far as I'm concerned it is, yes. Until this thread highlighted it I hadn't really considered this issue. It's a bit ugly that C extensions won't handle the full range of strings that pure python code will, but it's a typically pragmatic Python solution, so I'm not about to start a war about it. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/
skip@pobox.com wrote:
Greg> If a Python function is clearly wrapping a C function, one doesn't Greg> expect to be able to pass strings with embedded NULs to it.
Isn't that just floating an implementation detail up to the programmer (who may well not be POSIX- or Unix-aware)?
so if POSIX refuses to deal with, e.g., NUL bytes in file names, Python should somehow work around that to avoid "exposing implementation details" ? </F>
Greg> If a Python function is clearly wrapping a C function, one doesn't Greg> expect to be able to pass strings with embedded NULs to it. Skip> Isn't that just floating an implementation detail up to the Skip> programmer (who may well not be POSIX- or Unix-aware)? Fredrik> so if POSIX refuses to deal with, e.g., NUL bytes in file Fredrik> names, Python should somehow work around that to avoid Fredrik> "exposing implementation details" ? I don't know what the correct answer is. I suspect the right thing to do will vary depending on what C function is being wrapped. I was just making sure I understood correctly that there is a potential problem. Skip
skip@pobox.com wrote:
Greg> If a Python function is clearly wrapping a C function, one doesn't Greg> expect to be able to pass strings with embedded NULs to it.
Isn't that just floating an implementation detail up to the programmer (who may well not be POSIX- or Unix-aware)?
Yes, but in some cases that's unavoidable. It would be impractical to provide embedded-NUL-capable replacements for all C functions that someone might want (and flat-out impossible for some, e.g. os.open()). Greg
skip@pobox.com wrote:
As a Python programmer I'd get back what look like three strings: "http", ":", and "//www.python.org/". If each of them was a view onto part of the original string, only the last one would truly refer to a NUL-terminated sequence of characters. If I then wanted to see what scheme's value compared to, the string's comparison method would have to recognize that it wasn't truly NUL-terminated, copy it, call strncmp() or whatever underlying routine is used for string comparisons. (Maybe string comparisons are done inline. I'm sure there are some examples where the underlying C string routines are called.)
Python strings are character buffers with a known length, not null-terminated C strings. the CPython implementation guarantees that the character buffer has a trailing NULL character, but that's mostly to make it easy to pass Python strings directly to traditional C API:s. (string views are nothing new in Python. the original Unicode string implementation supported this, but that was partially removed during integration. the type still uses a separate buffer to hold the characters, though (unlike 8-bit strings that store the characters in the string object itself)) </F>
Fredrik> Python strings are character buffers with a known length, not Fredrik> null-terminated C strings. the CPython implementation Fredrik> guarantees that the character buffer has a trailing NULL Fredrik> character, but that's mostly to make it easy to pass Python Fredrik> strings directly to traditional C API:s. I'm obviously missing something that's been there all along. Since Python strings can contain NULs, why do we bother to NUL-terminate them? Clearly, any tradition C API that expects to operate on NUL-terminated strings would break with a string containing an embedded NUL. Skip
skip@pobox.com wrote:
Fredrik> Python strings are character buffers with a known length, not Fredrik> null-terminated C strings. the CPython implementation Fredrik> guarantees that the character buffer has a trailing NULL Fredrik> character, but that's mostly to make it easy to pass Python Fredrik> strings directly to traditional C API:s.
I'm obviously missing something that's been there all along. Since Python strings can contain NULs, why do we bother to NUL-terminate them? Clearly, any tradition C API that expects to operate on NUL-terminated strings would break with a string containing an embedded NUL.
sure, but that doesn't mean that such an API would break on a string that *doesn't* contain an embedded NUL. in practice, this is the difference between the "s" and "s#" argument specifiers; the former requires a NUL-free string, the latter can handle any byte string: >>> f = open("myfile\0") Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: file() argument 1 must be (encoded string without NULL bytes), not str >>> f = open("myfile") >>> f <open file 'myfile', mode 'r' at 0x0091E9A0> </F>
participants (7)
-
Delaney, Timothy (Tim)
-
Fredrik Lundh
-
Greg Ewing
-
Guido van Rossum
-
skip@pobox.com
-
Stephen J. Turnbull
-
Steve Holden