Cost-Free Slice into FromString constructors--Long
We've been talking this week about ideas for speeding up the parsing of Longs coming out of files or network. The use case is having a large string with embeded Long's and parsing them to real longs. One approach would be to use a simple slice: long(mystring[x:y]) an expensive operation in a tight loop. The proposed solution is to add further keyword arguments to Long (such as): long(mystring, base=10, start=x, end=y) The start/end would allow for negative indexes, as slices do, but otherwise simply limit the scope of the parsing. There are other solutions, using buffer-like objects and such, but this seems like a simple win for anyone parsing a lot of text. I implemented it in a branch runar-longslice-branch, but it would need to be updated with Tim's latest improvements to long. Then you may ask, why not do it for everything else parsing from string--to which I say it should. Thoughts? -- Runar Petursson Betur runar@betur.net -- http://betur.net
On Thu, May 25, 2006 at 03:01:36PM +0000, Runar Petursson wrote:
simply limit the scope of the parsing. There are other solutions, using buffer-like objects and such, but this seems like a simple win for anyone parsing a lot of text. I implemented it in a branch
I'll expand on that a little bit, since I talked to Runar yesterday about
it. My original blurb on the #nfs IRC channel was that this would seem to
have the "camel's nose under the tent" problem, that once this is in you
probably want it in lots of other places where slices may be used. Another
interface would be to have something like a lazy string slice that
functions that understood it could directly in C operate on the sub-buffer,
and oblivious functions would have to create a new slice.
Later, John showed Martin and I the Java byte buffer class, and Martin is
working on that now. This would probably provide a solution. The Java
interface allows you to push a new start and end, and the buffer then acts
like it's the string in that range. Useful for parsing without having to
create new objects in a tight loop.
Conversion functions taking a start and end are the low-hanging fruit, but
I think in the long term something that does that described I would prefer.
I believe that Martin is expecting expecting to have something this week
to try.
Thanks,
Sean
--
A computer lets you make more mistakes faster than any invention in human
history -- with the possible exceptions of handguns and tequila.
-- Mitch Ratcliffe
Sean Reifschneider, Member of Technical Staff
On 5/25/06, Sean Reifschneider
Conversion functions taking a start and end are the low-hanging fruit, but I think in the long term something that does that described I would prefer. I believe that Martin is expecting expecting to have something this week to try.
I'm -1 on this. It adds considerable API complications for something that 99.99% of users can do without regrets by using a slice. Yes, I know regex and string searches already have this API complication. They have a much more important use case. I don't want every API that takes a string argument to also take slice arguments; then everybody ends up doing duplicate work. If you really need this for speed, I recommend you make it a custom extension. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On 5/25/06, Guido van Rossum
If you really need this for speed, I recommend you make it a custom extension.
The custom extension is a good idea, and what we do now. But now that the
long algorithm is so fast, it would be nice to expose it at least at the C level to use a substring. Any solution--be it buffer based or even an module could have no reuse of the code without an End Pointer. If we added the end pointer to a FromSubstring (or somesuch) function (leaving PyLong_FromString alone), it would be easily reused from any parsing module. Assuming it didn't have any negative performance implication.
On 5/25/06, Runar Petursson
On 5/25/06, Guido van Rossum
wrote: If you really need this for speed, I recommend you make it a custom
extension.
The custom extension is a good idea, and what we do now. But now that the long algorithm is so fast, it would be nice to expose it at least at the C level to use a substring. Any solution--be it buffer based or even an module could have no reuse of the code without an End Pointer. If we added the end pointer to a FromSubstring (or somesuch) function (leaving PyLong_FromString alone), it would be easily reused from any parsing module. Assuming it didn't have any negative performance implication.
This discussion about conversion of longs is very much related to the new hot buffer class I'm currently adding in the bytebuf branch, which will directly tackle the issue of I/O on a fixed buffer without intermediate string creation or memory copy. You will be able to pack and unpack structs directly into and out of a fixed memory buffer. In case you're interested, it is a clone of the Java NIO ByteBuffer class for Python. I'm making minor modifications to the socket and struct modules to support I/O using the buffer protocol. The hot buffer class will live in a module and neither of the socket nor struct modules will depend on it. Following Fredrik's suggestion I will write up a mini-PEP for discussion of this idea and will move the code in the sandbox (and out of a branch). Additionally, I must say, all that hot water in the blue lagoon this morning was a really hot justficiation for the "hot buffer" name. http://furius.ca/tmp/nfs3/html/2006-05-25.11185801.html -- Martin Blais
On Thu, 25 May 2006 15:01:36 +0000, Runar Petursson
We've been talking this week about ideas for speeding up the parsing of Longs coming out of files or network. The use case is having a large string with embeded Long's and parsing them to real longs. One approach would be to use a simple slice: long(mystring[x:y])
an expensive operation in a tight loop. The proposed solution is to add further keyword arguments to Long (such as):
long(mystring, base=10, start=x, end=y)
The start/end would allow for negative indexes, as slices do, but otherwise simply limit the scope of the parsing. There are other solutions, using buffer-like objects and such, but this seems like a simple win for anyone parsing a lot of text. I implemented it in a branch runar-longslice- branch, but it would need to be updated with Tim's latest improvements to long. Then you may ask, why not do it for everything else parsing from string--to which I say it should. Thoughts?
This really seems like a poor option. Why fix the problem with a hundred special cases instead of a single general solution? Hmm, one reason could be that the general solution doesn't work: exarkun@kunai:~$ python Python 2.4.3 (#2, Apr 27 2006, 14:43:58) [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
long(buffer('1234', 0, 3)) Traceback (most recent call last): File "<stdin>", line 1, in ? ValueError: null byte in argument for long() long(buffer('123a', 0, 3)) Traceback (most recent call last): File "<stdin>", line 1, in ? ValueError: invalid literal for long(): 123a
Still, fixing that seems like a better idea. ;) Jean-Paul
On May 25, 2006, at 3:28 PM, Jean-Paul Calderone wrote:
On Thu, 25 May 2006 15:01:36 +0000, Runar Petursson
wrote: We've been talking this week about ideas for speeding up the parsing of Longs coming out of files or network. The use case is having a large string with embeded Long's and parsing them to real longs. One approach would be to use a simple slice: long(mystring[x:y])
an expensive operation in a tight loop. The proposed solution is to add further keyword arguments to Long (such as):
long(mystring, base=10, start=x, end=y)
The start/end would allow for negative indexes, as slices do, but otherwise simply limit the scope of the parsing. There are other solutions, using buffer-like objects and such, but this seems like a simple win for anyone parsing a lot of text. I implemented it in a branch runar- longslice- branch, but it would need to be updated with Tim's latest improvements to long. Then you may ask, why not do it for everything else parsing from string--to which I say it should. Thoughts?
This really seems like a poor option. Why fix the problem with a hundred special cases instead of a single general solution?
Hmm, one reason could be that the general solution doesn't work:
exarkun@kunai:~$ python Python 2.4.3 (#2, Apr 27 2006, 14:43:58) [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
long(buffer('1234', 0, 3)) Traceback (most recent call last): File "<stdin>", line 1, in ? ValueError: null byte in argument for long() long(buffer('123a', 0, 3)) Traceback (most recent call last): File "<stdin>", line 1, in ? ValueError: invalid literal for long(): 123a
One problem with buffer() is that it does a memcpy of the buffer. A zero-copy version of buffer (a view on some object that implements the buffer API) would be nice. -bob
Bob Ippolito
One problem with buffer() is that it does a memcpy of the buffer. A zero-copy version of buffer (a view on some object that implements the buffer API) would be nice.
Did buffers change? I seem to remember that there were segfaulting conditions when using array and buffer due to arrays being resizable and leaving buffers pointing at ether...or is the memcpy part of the buffer 'fix'? If so, the changes in performance and memory use may surprise long-time users of buffer. - Josiah P.S. There is a possible solution to the "buffer pointing at ether due to realloc" problem, but it would rely on a non-existing (from what I understand about memory allocation) "would a realloc of pointer foo point to the same address?" Of course the other alternative is to just always alloc when resizing arrays, etc., leaving a version of the old object around until all buffers pointing to that object have been freed.
[Jean-Paul Calderone]
... Hmm, one reason could be that the general solution doesn't work:
exarkun@kunai:~$ python Python 2.4.3 (#2, Apr 27 2006, 14:43:58) [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
long(buffer('1234', 0, 3)) Traceback (most recent call last): File "<stdin>", line 1, in ? ValueError: null byte in argument for long() long(buffer('123a', 0, 3)) Traceback (most recent call last): File "<stdin>", line 1, in ? ValueError: invalid literal for long(): 123a
[Bob Ippolito]
One problem with buffer() is that it does a memcpy of the buffer.
It does not, at least not in the cases above. In fact, that's essentially _why_ they fail now. Here's what actually happens (in outline) for the first example: a buffer object is created that merely points to the "1234" string object, recording the offset of 0 and the length of 3. PyLong_FromString() only sees the starting address, and-- as it always does --parses until it hits a character that doesn't make sense for the input base. That happens to be the NUL at the end of the "1234" string object guts (Python string objects are always NUL-terminated, BTW). PyLong_FromString() is perfectly happy, but converted "1234" rather than "123". It's PyLong_FromString()'s _caller_ that's unhappy, because PyLong_FromString also told the caller that it stopped parsing at offset 4, and then if (end != s + len) { PyErr_SetString(PyExc_ValueError, "null byte in argument for long()"); The assumption here is bad: the caller assumes that if end != s+len, it must be the case that PyLong_FromString() stopped parsing _before_ hitting the end, and did so _because_ it hit an embedded NUL byte. Neither is true in this case, so the error message is senseless. In any case, none of this would have happened if the buffer object had done a memcpy of the 3 requested bytes, and NUL-terminated it.
A zero-copy version of buffer (a view on some object that implements the buffer API) would be nice.
Above we already had that. The internal parsing APIs don't currently support the buffer's "offset & length" view of the world, so have no chance of working as hoped with any kind of buffer object now.
Tim Peters wrote:
PyLong_FromString() only sees the starting address, and-- as it always does --parses until it hits a character that doesn't make sense for the input base.
This is the bug, then. long() shouldn't be using PyLong_FromString() to convert its argument, but something that takes an address and a size. (Is there a PyLong_FromStringAndSize()? If not, there should be.) -- Greg
[Tim]
PyLong_FromString() only sees the starting address, and-- as it always does --parses until it hits a character that doesn't make sense for the input base.
[Greg Ewing]
This is the bug, then. long() shouldn't be using PyLong_FromString() to convert its argument, but something that takes an address and a size. (Is there a PyLong_FromStringAndSize()? If not, there should be.)
Yes, that's what I said ;-): The internal parsing APIs don't currently support the buffer's "offset & length" view of the world, so have no chance of working as hoped with any kind of buffer object now. Note that this isn't limited to long(): _no_ internal parsing routine has a "sliceish" API now, and none of the code on the way toward calling parsing routines is paying any attention to that it can't work when handed a buffer object. All of that must change. At least the complex() constructor gives up at once:
b = buffer("123a", 0, 3) long(b) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for long() with base 10: '123a' int(b) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 10: '123a' float(b) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for float(): 123a complex(b) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: complex() argument must be a string or a number
Runar Petursson wrote:
We've been talking this week about ideas for speeding up the parsing of Longs coming out of files or network. The use case is having a large string with embeded Long's and parsing them to real longs. One approach would be to use a simple slice: long(mystring[x:y])
an expensive operation in a tight loop. The proposed solution is to add further keyword arguments to Long (such as):
long(mystring, base=10, start=x, end=y)
The start/end would allow for negative indexes, as slices do, but otherwise simply limit the scope of the parsing. There are other solutions, using buffer-like objects and such, but this seems like a simple win for anyone parsing a lot of text. I implemented it in a branch runar-longslice-branch, but it would need to be updated with Tim's latest improvements to long. Then you may ask, why not do it for everything else parsing from string--to which I say it should. Thoughts?
-1 This is a somewhat specialized parsing application and should not be allowed to muck-up an otherwise simple, general-purpose API. Micro-optimizations do not warrant API changes. Certainly, it should not be propogated to everything that can parse from a string. Also, you are likely to find that the cost of varargs and kwargs will offset the slicing gains. I think the whole notion is off base. The int(mystring[x:y]) situation is only important when you're doing it many times. Often, you'll be doing other conversions as well. So, you would be better-off creating a text version of the struct module that would enable you to extract and convert many elements from a long record stored as text. Alternately, you could expose a function styled after fscanf(). IOW, better to provide to general purpose text parsing tool than to muck-up the whole language for one specialized application. Raymond
participants (10)
-
Bob Ippolito
-
Greg Ewing
-
Guido van Rossum
-
Jean-Paul Calderone
-
Josiah Carlson
-
Martin Blais
-
Raymond Hettinger
-
Runar Petursson
-
Sean Reifschneider
-
Tim Peters