TextIO seek and tell cookies

Hi all, I recently shot myself in the foot by assuming that TextIO.tell returned integers rather than opaque cookies. Specifically I was adding an offset to the value returned by TextIO.tell. In retrospect this doesn't make sense/ Now, I don't want to drive change simply because I failed to read the documentation carefully, but I think the current API is very easy to misuse. Most of the time TextIO.tell returns a cookie that is actually an integer and adding an offset to it and seek-ing works fine. The only indication you get that you are mis-using the API is that sometimes tell returns a cookie that when you add an integer offset to it will cause seek() to fail with an OverflowError. Would it be possible to change the API to return something more opaque? E.g.: rather than converting the C cookie structure to a long, could it instead be converted to a bytes() object. (I.e.: Change textiowrapper_build_cookie to use PyBytes_FromStringAndSize rather than _PyLong_FromByteArray and equivalent for textiowrapper_parse_cookie). This would ensure the return value is never mis-used and is probably also faster using bytes objects than converting to/from an integer. Are there any downsides to this? I've made some progress developing a patch to change this functionality. Is it worth polishing and submitting? Cheers, Ben

On 2016-09-26 00:21, Ben Leslie wrote:
Hi all,
I recently shot myself in the foot by assuming that TextIO.tell returned integers rather than opaque cookies. Specifically I was adding an offset to the value returned by TextIO.tell. In retrospect this doesn't make sense/
Now, I don't want to drive change simply because I failed to read the documentation carefully, but I think the current API is very easy to misuse. Most of the time TextIO.tell returns a cookie that is actually an integer and adding an offset to it and seek-ing works fine.
The only indication you get that you are mis-using the API is that sometimes tell returns a cookie that when you add an integer offset to it will cause seek() to fail with an OverflowError.
Would it be possible to change the API to return something more opaque? E.g.: rather than converting the C cookie structure to a long, could it instead be converted to a bytes() object.
(I.e.: Change textiowrapper_build_cookie to use PyBytes_FromStringAndSize rather than _PyLong_FromByteArray and equivalent for textiowrapper_parse_cookie).
This would ensure the return value is never mis-used and is probably also faster using bytes objects than converting to/from an integer.
why would it be faster? It's an integer internally.
Are there any downsides to this? I've made some progress developing a patch to change this functionality. Is it worth polishing and submitting?
An alternative might be a subclass of int.

On 26 September 2016 at 10:21, MRAB <python@mrabarnett.plus.com> wrote:
On 2016-09-26 00:21, Ben Leslie wrote:
Are there any downsides to this? I've made some progress developing a patch to change this functionality. Is it worth polishing and submitting?
An alternative might be a subclass of int.
It could make sense to use a subclass of int that emitted deprecation warnings for integer arithmetic, and then eventually disallowed it entirely. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Be careful though, comparing these to plain integers should probably be allowed, and we also should make sure that things like serialization via JSON or storing in an SQL database don't break. I personally think it's one of those "learn not to touch the stove" cases and there's limited value in making this API idiot proof. On Sun, Sep 25, 2016 at 9:05 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 26 September 2016 at 10:21, MRAB <python@mrabarnett.plus.com> wrote:
On 2016-09-26 00:21, Ben Leslie wrote:
Are there any downsides to this? I've made some progress developing a patch to change this functionality. Is it worth polishing and submitting?
An alternative might be a subclass of int.
It could make sense to use a subclass of int that emitted deprecation warnings for integer arithmetic, and then eventually disallowed it entirely.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)

On 25 September 2016 at 21:18, Guido van Rossum <guido@python.org> wrote:
Be careful though, comparing these to plain integers should probably be allowed,
There's a good reason why it's "opaque" ... why would you want to make it less opaque? And I'm curious why Python didn't adopt the fgetpos/fsetpos style that makes the data structure completely opaque (fpos_t). IIRC, this was added to C when the ANSI standard was first written, to allow cross-platform compatibility in cases where ftell/fseek was difficult (or impossible) to fully implement. Maybe those reasons don't matter any more (e.g., dealing with record-oriented or keyed file systems) ...
and we also should make sure that things like serialization via JSON or storing in an SQL database don't break. I personally think it's one of those "learn not to touch the stove" cases and there's limited value in making this API idiot proof.
On Sun, Sep 25, 2016 at 9:05 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 26 September 2016 at 10:21, MRAB <python@mrabarnett.plus.com> wrote:
On 2016-09-26 00:21, Ben Leslie wrote:
Are there any downsides to this? I've made some progress developing a patch to change this functionality. Is it worth polishing and submitting?
An alternative might be a subclass of int.
It could make sense to use a subclass of int that emitted deprecation warnings for integer arithmetic, and then eventually disallowed it entirely.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ guido%40python.org
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ pludemann%40google.com

I think the case of JSON or SQL database is even more important though. tell/seek can return 129-bit integers (maybe even more? my maths might be off here). The very large integers that can be returned by tell() will break serialization to JSON, and storing in a SQL database (at least for most database types). What is the value of comparing these to plain integers? Unless you happen to know the magic encoding it isn't going to be very useful I think? Cheers, Ben On 25 September 2016 at 21:18, Guido van Rossum <guido@python.org> wrote:
Be careful though, comparing these to plain integers should probably be allowed, and we also should make sure that things like serialization via JSON or storing in an SQL database don't break. I personally think it's one of those "learn not to touch the stove" cases and there's limited value in making this API idiot proof.
On Sun, Sep 25, 2016 at 9:05 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 26 September 2016 at 10:21, MRAB <python@mrabarnett.plus.com> wrote:
On 2016-09-26 00:21, Ben Leslie wrote:
Are there any downsides to this? I've made some progress developing a patch to change this functionality. Is it worth polishing and submitting?
An alternative might be a subclass of int.
It could make sense to use a subclass of int that emitted deprecation warnings for integer arithmetic, and then eventually disallowed it entirely.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/benno%40benno.id.au

It was pointed out in private email that technically JSON can represent very large integers even if ECMAScript itself can't. But the idea of transmitting these offsets outside of a running process is not something that I had anticipated. It got me thinking: is there a guarantee that these opaque values returned from tell() is stable across different versions of Python? My reading of opaque is that it could be subject to change, but that possibly isn't the intent. It seems that since the sizeof(int) and sizeof(Py_off_t) could be different in different builds of Python even off the same version, then the opaque value returned is necessarily going to be different between builds of even the same version of Python. It seems like it would be prudent to discourage the sharing of these opaque cookies (such as via a database or interchange formats) as you'd have to be very sure that they would be interpreted correctly in any receiving instance. Cheers, Ben On 26 September 2016 at 02:30, Ben Leslie <benno@benno.id.au> wrote:
I think the case of JSON or SQL database is even more important though.
tell/seek can return 129-bit integers (maybe even more? my maths might be off here).
The very large integers that can be returned by tell() will break serialization to JSON, and storing in a SQL database (at least for most database types).
What is the value of comparing these to plain integers? Unless you happen to know the magic encoding it isn't going to be very useful I think?
Cheers,
Ben
On 25 September 2016 at 21:18, Guido van Rossum <guido@python.org> wrote:
Be careful though, comparing these to plain integers should probably be allowed, and we also should make sure that things like serialization via JSON or storing in an SQL database don't break. I personally think it's one of those "learn not to touch the stove" cases and there's limited value in making this API idiot proof.
On Sun, Sep 25, 2016 at 9:05 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 26 September 2016 at 10:21, MRAB <python@mrabarnett.plus.com> wrote:
On 2016-09-26 00:21, Ben Leslie wrote:
Are there any downsides to this? I've made some progress developing a patch to change this functionality. Is it worth polishing and submitting?
An alternative might be a subclass of int.
It could make sense to use a subclass of int that emitted deprecation warnings for integer arithmetic, and then eventually disallowed it entirely.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/benno%40benno.id.au

Ben Leslie wrote:
But the idea of transmitting these offsets outside of a running process is not something that I had anticipated. It got me thinking: is there a guarantee that these opaque values returned from tell() is stable across different versions of Python?
Are they even guaranteed to work on a different file object in the same process? I.e. if you read some stuff from a file, do tell() on it, then close it, open it again and seek() with that token, are you guaranteed to end up at the same place in the file? -- Greg

Yeah, that should work. The implementation is something like a byte offset to the start of a line plus a character count, plus some misc flags. I found this implementation in the 2.6 code (the last version where it was pure Python code): def _pack_cookie(self, position, dec_flags=0, bytes_to_feed=0, need_eof=0, chars_to_skip=0): # The meaning of a tell() cookie is: seek to position, set the # decoder flags to dec_flags, read bytes_to_feed bytes, feed them # into the decoder with need_eof as the EOF flag, then skip # chars_to_skip characters of the decoded result. For most simple # decoders, tell() will often just give a byte offset in the file. return (position | (dec_flags<<64) | (bytes_to_feed<<128) | (chars_to_skip<<192) | bool(need_eof)<<256) def _unpack_cookie(self, bigint): rest, position = divmod(bigint, 1<<64) rest, dec_flags = divmod(rest, 1<<64) rest, bytes_to_feed = divmod(rest, 1<<64) need_eof, chars_to_skip = divmod(rest, 1<<64) return position, dec_flags, bytes_to_feed, need_eof, chars_to_skip On Mon, Sep 26, 2016 at 3:43 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Ben Leslie wrote:
But the idea of transmitting these offsets outside of a running process is not something that I had anticipated. It got me thinking: is there a guarantee that these opaque values returned from tell() is stable across different versions of Python?
Are they even guaranteed to work on a different file object in the same process? I.e. if you read some stuff from a file, do tell() on it, then close it, open it again and seek() with that token, are you guaranteed to end up at the same place in the file?
-- Greg
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)

On Mon, Sep 26, 2016, at 05:30, Ben Leslie wrote:
I think the case of JSON or SQL database is even more important though.
tell/seek can return 129-bit integers (maybe even more? my maths might be off here).
The very large integers that can be returned by tell() will break serialization to JSON, and storing in a SQL database (at least for most database types).
What is the value of comparing these to plain integers? Unless you happen to know the magic encoding it isn't going to be very useful I think?
I assume the value is that in the circumstances in which all of the flags and other bits are zero, they can be used as offsets in precisely the way that you used them. It may also be possible that in some cases where they are not zero, doing arithmetic with them is still "safe" since the real offset is still in the low-order bits. I don't know if those circumstances are predictable enough for it to be worthwhile. Changing it would obviously break code that does this (unless, perhaps, it were changed to be a class with arithmetic operators), the question is whether such code "deserves" to be broken. In my own tests, even a UTF-8-sig file with DOS line endings "worked". Does anyone have information about what circumstances can reliably cause tell() to return values that are *not* simple integers? Maybe it has something to do with working with stateful encodings like iso-2022 or UTF-7? What was the situation that caused your problem?

On 25 September 2016 at 17:21, MRAB <python@mrabarnett.plus.com> wrote:
On 2016-09-26 00:21, Ben Leslie wrote:
Hi all,
I recently shot myself in the foot by assuming that TextIO.tell returned integers rather than opaque cookies. Specifically I was adding an offset to the value returned by TextIO.tell. In retrospect this doesn't make sense/
Now, I don't want to drive change simply because I failed to read the documentation carefully, but I think the current API is very easy to misuse. Most of the time TextIO.tell returns a cookie that is actually an integer and adding an offset to it and seek-ing works fine.
The only indication you get that you are mis-using the API is that sometimes tell returns a cookie that when you add an integer offset to it will cause seek() to fail with an OverflowError.
Would it be possible to change the API to return something more opaque? E.g.: rather than converting the C cookie structure to a long, could it instead be converted to a bytes() object.
(I.e.: Change textiowrapper_build_cookie to use PyBytes_FromStringAndSize rather than _PyLong_FromByteArray and equivalent for textiowrapper_parse_cookie).
This would ensure the return value is never mis-used and is probably also faster using bytes objects than converting to/from an integer.
why would it be faster? It's an integer internally.
It isn't an integer internally though, it is a cookie: typedef struct { Py_off_t start_pos; int dec_flags; int bytes_to_feed; int chars_to_skip; char need_eof; } cookie_type; The memory view of this structure is then converted to a long. Surely converting to a PyLong is more work than converting to bytes? In any case, performance really isn't the motivation here. Cheers, Ben
participants (7)
-
Ben Leslie
-
Greg Ewing
-
Guido van Rossum
-
MRAB
-
Nick Coghlan
-
Peter Ludemann
-
Random832