TextIOBase: Make tell() and seek() pythonic

seek() and tell() works with opaque values, called cookies. This is close to low level details, but it is not pythonic. (Non-pythonic and non-portable behaviour) Currently feeding seek() with wrong values could lead to unexpected behaviour. There should be a safer abstraction to these two basic functions. More details in the issue: https://github.com/python/cpython/issues/93101#issue-1244996658

On Tue, 24 May 2022 at 23:03, <mguinhos@gmail.com> wrote:
Definitely not. Being able to rewind a text file to a known location is incredibly useful, and hiding these methods would be a net negative for the language. You haven't answered the questions on that issue, so I won't touch the part where you want magic to happen until you've explained that part. ChrisA

mguinhos@gmail.com writes:
There should be a safer abstraction to these two basic functions.
There is: TextIOBase.read, then treat it as an array of code units (NOT CHARACTERS!!)
More details in the issue:
Not at all persuasive. I'm with Chris: you need to present the abstraction you want. One thing you don't seem to understand: Python does *not* know about characters natively. str is an array of *code units*. This is much better than the pre-PEP-393 situation (where the unicode type was UTF-16, nowadays except for PEP 383 non-decodable bytes there are no surrogates to worry about), but Python doesn't care if you use NFD, and there are characters that have no composed version (some are the kind of thing you see in @jwz's display name on Twitter, but some of them are characters that exist in national standards but not in Unicode NFC form, I believe). If code points are good enough for you, you need to specify that. -- I, too, gruntle. What about it?

IIRC, there were two builds- 16 and 32 bit Unicode. But it wasn’t UTF16, it was UCS-2. -CHB On Wed, May 25, 2022 at 11:32 AM Barry Scott <barry@barrys-emacs.org> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On 5/26/22, Christopher Barker <pythonchb@gmail.com> wrote:
IIRC, there were two builds- 16 and 32 bit Unicode. But it wasn’t UTF16, it was UCS-2.
In the old implementation prior to 3.3, narrow and wide builds were supported regardless of the size of wchar_t. For a narrow build, if wchar_t was 32-bit, then PyUnicode_FromWideChar() would encode non-BMP ordinals as UTF-16 surrogate pairs, and PyUnicode_AsWideChar() implemented the reverse, from UTF-16 back to UTF-32. There were several similar cases, such as PyUnicode_FromOrdinal(). The header called this "limited" UTF-16 support, primarily I suppose because the length of strings and indexing failed to account for surrogate pairs. For example: >>> s = '\U00010000' >>> len(s) 2 >>> s[0] '\ud800' >>> s[1] '\udc00' Here's a link to the old implementation: https://github.com/python/cpython/blob/v3.2.6/Objects/unicodeobject.c

On Wed, May 25, 2022 at 06:16:50PM +0900, Stephen J. Turnbull wrote:
No need to shout :-) Reading the full thread on the bug tracker, I think that when Marcel (mguinhos) refers to "characters", he probably is thinking of "code points" (not code units, as you put it). Digression into the confusing Unicode terminology, for the benefit of those who are confused... (which also includes me... I'm writing this out so I can get it clear in my own mind). A *code point* is an integer between 0 and 0x10FFFF inclusive, each of which represents a Unicode entity. In common language, we call those entities "characters", although they don't perfectly map to characters in natural language. Most code points are as yet unused, most of the rest represent natural language characters, some represent fragments of characters, and some are explicitly designated "non-characters". (Even the Unicode consortium occasionally calls these abstract entities characters, so let's not get too uptight about mislabelling them.) Abstract code points 0...0x10FFF are all very well and good, but they have to be stored in memory somehow, and that's where *code units* come into it: a *code unit* is a chunk of memory, usually 8 bits, 16 bits, or 32 bits. https://unicode.org/glossary/#code_unit The number of code units used to represent each code point depends on the encoding used: * UCS-2 is a fixed size encoding, where 1 x 16-bit code unit represents a code point between 0 and 0xFFFF. * UTF-16 is a variable size encoding, where 1 or 2 x 16-bit code units represents a code point between 0 and 0x10FFFF. * UCS-4 and UTF-32 are (identical) fixed size encodings, where 1 x 32-bit code unit represents each code point. * UTF-8 is a variable size encoding, where 1, 2, 3 or 4 x 8-bit code units represent each code point. * UTF-7 is a variable size encoding which uses 1-8 7-bit code units. Let's not talk about that one. That's Unicode. But TextIOBase doesn't just support Unicode, it also supports legacy encodings which don't define code points or code units. Nevertheless we can abuse the terminology and pretend that they do, e.g. most such legacy encodings use a fixed 1 x 8-bit code unit (a byte) to represent a code point (a character). Some are variable size, e.g. SHIFT-JIS. So with this mild abuse of terminology, we can pretend that all(?) those old legacy encodings are "Unicode". TL;DR: Every character, or non-character, or bit of a character, which for the sake of brevity I will just call "character", is represented by an abstract numeric value between 0 and 0x10FFFF (the code point), which in turn is implemented by a chunk of memory between 1 and N bytes in size, for some value of N that depends on the encoding.
One thing you don't seem to understand: Python does *not* know about characters natively. str is an array of *code units*.
Code points, not units. Except that even the Unicode Consortium sometimes calls them "characters" in plain English. E.g. the code point U+0041 which has numeric value 0x41 or 65 in decimal represents the character "A". (Other code points do not represent natural language characters, but if ASCII can call control characters like NULL and BEL "characters", we can do the same for code points like U+FDD0, official Unicode terminology be damned.)
Narrow builds were UCS-2; wide builds were UTC-32. The situation was complicated in that your terminal was probably UTF-16, and so a surrogate pair that Python saw as two code points may have been displayed by the terminal as a single character.
but Python doesn't care if you use NFD,
The *normalisation forms* NFD etc operate at the level of code points, not encodings. I believe you may be trying to distinguish between what Unicode calls "graphemes", which is very nearly the same as natural language characters (plus control characters, noncharacters, etc), versus plain old code points. For example, the grapheme (natural character) ü may be normalised as the single code point U+00FC LATIN SMALL LETTER U WITH DIAERESIS or as a sequence of code points: U+0075 LATIN SMALL LETTER U U+0308 COMBINING DIAERESIS I believe that dealing with graphemes is a red-herring, and that is not what Marcel has in mind. -- Steve (the other one)

On Thu, May 26, 2022 at 08:28:24PM +1000, Steven D'Aprano wrote:
Narrow builds were UCS-2; wide builds were UTC-32.
To be more precise, narrow builds were sort of a hybrid between an incomplete version of UTF-16 and a superset of UCS-2. Like UTF-16, if your code point was above U+FFFF, it would be represented by a pair of surrogate code points. But like UCS-2, that surrogate pair was seen as two characters rather than one. (If you think this is complicated and convoluted, yes, yes it is.) -- Steve

On Tue, May 24, 2022 at 04:31:13AM -0000, mguinhos@gmail.com wrote:
seek() and tell() works with opaque values, called cookies. This is close to low level details, but it is not pythonic.
Even after reading the issue you linked to, I am not sure I understand either the issue, or your suggested solution. I *think* that the issue is this: Suppose we have a text file containing four characters (to be precise: code points). aΩλz namely U+0061 U+03A9 U+03BB U+007A. You would like tell() and seek() to accept indexes 0, 1, 2, 3, 4 which would move the file pointer to: 0 moves to the start of the file, just before the a 1 moves to just before the Ω 2 moves to just before the λ 3 moves to just before the z 4 moves to after the z (EOF). **But** in reality, the file position cookies for that file will depend on the encoding used. For UTF-8, the valid cookies are: 0 moves to the start of the file, just before the a 1 moves to just before the Ω 3 moves to just before the λ 5 moves to just before the z 6 moves to after the z (EOF). Other encodings may give different cookies. If you seek() to position 4, say, the results will be unpredictable but probably not anything good. In other words, the tell() and seek() cookies represent file positions in **bytes**, even though we are reading or writing a text file. You would like the cookies to be file positions measured in **characters** (or to be precise, code points). Am I close? -- Steve

On 5/26/22, Steven D'Aprano <steve@pearwood.info> wrote:
To clarify the general context, text I/O tell() and seek() cookies aren't necessarily just a byte offset. They can be packed integers that include a start position, decoder flags, a number of bytes to be fed into the decoder, whether the decode operation should be final (EOF), and the number of decoded characters (ordinals) to skip. For example: >>> open('spam.txt', 'w', encoding='utf-7').write('\u0100'*4) 4 >>> f = open('spam.txt', encoding='utf-7') >>> f.read(2) 'ĀĀ' >>> f.tell() 680564734871843039612185603579607777280 >>> start_pos, dec_flags, bytes_to_feed, need_eof, chars_to_skip = ( ... _pyio.TextIOWrapper._unpack_cookie(..., f.tell())) >>> start_pos, dec_flags, bytes_to_feed, need_eof, chars_to_skip (0, 55834574848, 2, False, 0)

On Thu, 26 May 2022 at 22:07, Eryk Sun <eryksun@gmail.com> wrote:
If I'm reading this correctly, the result from f.tell() has enough information to reconstruct a position within a hypothetical array of code points contained within the file (that is to say - if you read the entire file into a string, f.tell() returns something that can be turned into an index into that string), but that position might not actually correspond to a single byte location. Is that it? I think UTF-7 is an awesome encoding. Really good at destroying people's expectations of what they thought they could depend on. (Terrible for actually using, though.) ChrisA

Chris Angelico writes:
That's what the OP wants. That's not what f.tell does. f.tell returns information sufficient to recreate the state of the stream I/O part of a codec when it reaches that point in the stream. Its *purpose* is to support producing the rest of the str produced by f.read() after a shorter read, but f.tell doesn't care if the str ever existed or ever will exist in the process that calls it.
Better than the alternatives for its intended use cases, which tells you a lot about the hostile environment RFC-822 created for the rest of the world. ;-)

On Fri, 27 May 2022 at 12:01, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
Well, the OP wants to be able to do arithmetic on it, but what actually happens is that it's nothing more than a magic cookie. But it ought to be able to recreate the state at any point, after returning any number of characters. Right? There's no way that it would ever fail to get to the same position in a character stream. It's just that there's no way to synthesize that, you have to start at the beginning. ChrisA

On Tue, 24 May 2022 at 23:03, <mguinhos@gmail.com> wrote:
Definitely not. Being able to rewind a text file to a known location is incredibly useful, and hiding these methods would be a net negative for the language. You haven't answered the questions on that issue, so I won't touch the part where you want magic to happen until you've explained that part. ChrisA

mguinhos@gmail.com writes:
There should be a safer abstraction to these two basic functions.
There is: TextIOBase.read, then treat it as an array of code units (NOT CHARACTERS!!)
More details in the issue:
Not at all persuasive. I'm with Chris: you need to present the abstraction you want. One thing you don't seem to understand: Python does *not* know about characters natively. str is an array of *code units*. This is much better than the pre-PEP-393 situation (where the unicode type was UTF-16, nowadays except for PEP 383 non-decodable bytes there are no surrogates to worry about), but Python doesn't care if you use NFD, and there are characters that have no composed version (some are the kind of thing you see in @jwz's display name on Twitter, but some of them are characters that exist in national standards but not in Unicode NFC form, I believe). If code points are good enough for you, you need to specify that. -- I, too, gruntle. What about it?

IIRC, there were two builds- 16 and 32 bit Unicode. But it wasn’t UTF16, it was UCS-2. -CHB On Wed, May 25, 2022 at 11:32 AM Barry Scott <barry@barrys-emacs.org> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On 5/26/22, Christopher Barker <pythonchb@gmail.com> wrote:
IIRC, there were two builds- 16 and 32 bit Unicode. But it wasn’t UTF16, it was UCS-2.
In the old implementation prior to 3.3, narrow and wide builds were supported regardless of the size of wchar_t. For a narrow build, if wchar_t was 32-bit, then PyUnicode_FromWideChar() would encode non-BMP ordinals as UTF-16 surrogate pairs, and PyUnicode_AsWideChar() implemented the reverse, from UTF-16 back to UTF-32. There were several similar cases, such as PyUnicode_FromOrdinal(). The header called this "limited" UTF-16 support, primarily I suppose because the length of strings and indexing failed to account for surrogate pairs. For example: >>> s = '\U00010000' >>> len(s) 2 >>> s[0] '\ud800' >>> s[1] '\udc00' Here's a link to the old implementation: https://github.com/python/cpython/blob/v3.2.6/Objects/unicodeobject.c

On Wed, May 25, 2022 at 06:16:50PM +0900, Stephen J. Turnbull wrote:
No need to shout :-) Reading the full thread on the bug tracker, I think that when Marcel (mguinhos) refers to "characters", he probably is thinking of "code points" (not code units, as you put it). Digression into the confusing Unicode terminology, for the benefit of those who are confused... (which also includes me... I'm writing this out so I can get it clear in my own mind). A *code point* is an integer between 0 and 0x10FFFF inclusive, each of which represents a Unicode entity. In common language, we call those entities "characters", although they don't perfectly map to characters in natural language. Most code points are as yet unused, most of the rest represent natural language characters, some represent fragments of characters, and some are explicitly designated "non-characters". (Even the Unicode consortium occasionally calls these abstract entities characters, so let's not get too uptight about mislabelling them.) Abstract code points 0...0x10FFF are all very well and good, but they have to be stored in memory somehow, and that's where *code units* come into it: a *code unit* is a chunk of memory, usually 8 bits, 16 bits, or 32 bits. https://unicode.org/glossary/#code_unit The number of code units used to represent each code point depends on the encoding used: * UCS-2 is a fixed size encoding, where 1 x 16-bit code unit represents a code point between 0 and 0xFFFF. * UTF-16 is a variable size encoding, where 1 or 2 x 16-bit code units represents a code point between 0 and 0x10FFFF. * UCS-4 and UTF-32 are (identical) fixed size encodings, where 1 x 32-bit code unit represents each code point. * UTF-8 is a variable size encoding, where 1, 2, 3 or 4 x 8-bit code units represent each code point. * UTF-7 is a variable size encoding which uses 1-8 7-bit code units. Let's not talk about that one. That's Unicode. But TextIOBase doesn't just support Unicode, it also supports legacy encodings which don't define code points or code units. Nevertheless we can abuse the terminology and pretend that they do, e.g. most such legacy encodings use a fixed 1 x 8-bit code unit (a byte) to represent a code point (a character). Some are variable size, e.g. SHIFT-JIS. So with this mild abuse of terminology, we can pretend that all(?) those old legacy encodings are "Unicode". TL;DR: Every character, or non-character, or bit of a character, which for the sake of brevity I will just call "character", is represented by an abstract numeric value between 0 and 0x10FFFF (the code point), which in turn is implemented by a chunk of memory between 1 and N bytes in size, for some value of N that depends on the encoding.
One thing you don't seem to understand: Python does *not* know about characters natively. str is an array of *code units*.
Code points, not units. Except that even the Unicode Consortium sometimes calls them "characters" in plain English. E.g. the code point U+0041 which has numeric value 0x41 or 65 in decimal represents the character "A". (Other code points do not represent natural language characters, but if ASCII can call control characters like NULL and BEL "characters", we can do the same for code points like U+FDD0, official Unicode terminology be damned.)
Narrow builds were UCS-2; wide builds were UTC-32. The situation was complicated in that your terminal was probably UTF-16, and so a surrogate pair that Python saw as two code points may have been displayed by the terminal as a single character.
but Python doesn't care if you use NFD,
The *normalisation forms* NFD etc operate at the level of code points, not encodings. I believe you may be trying to distinguish between what Unicode calls "graphemes", which is very nearly the same as natural language characters (plus control characters, noncharacters, etc), versus plain old code points. For example, the grapheme (natural character) ü may be normalised as the single code point U+00FC LATIN SMALL LETTER U WITH DIAERESIS or as a sequence of code points: U+0075 LATIN SMALL LETTER U U+0308 COMBINING DIAERESIS I believe that dealing with graphemes is a red-herring, and that is not what Marcel has in mind. -- Steve (the other one)

On Thu, May 26, 2022 at 08:28:24PM +1000, Steven D'Aprano wrote:
Narrow builds were UCS-2; wide builds were UTC-32.
To be more precise, narrow builds were sort of a hybrid between an incomplete version of UTF-16 and a superset of UCS-2. Like UTF-16, if your code point was above U+FFFF, it would be represented by a pair of surrogate code points. But like UCS-2, that surrogate pair was seen as two characters rather than one. (If you think this is complicated and convoluted, yes, yes it is.) -- Steve

On Tue, May 24, 2022 at 04:31:13AM -0000, mguinhos@gmail.com wrote:
seek() and tell() works with opaque values, called cookies. This is close to low level details, but it is not pythonic.
Even after reading the issue you linked to, I am not sure I understand either the issue, or your suggested solution. I *think* that the issue is this: Suppose we have a text file containing four characters (to be precise: code points). aΩλz namely U+0061 U+03A9 U+03BB U+007A. You would like tell() and seek() to accept indexes 0, 1, 2, 3, 4 which would move the file pointer to: 0 moves to the start of the file, just before the a 1 moves to just before the Ω 2 moves to just before the λ 3 moves to just before the z 4 moves to after the z (EOF). **But** in reality, the file position cookies for that file will depend on the encoding used. For UTF-8, the valid cookies are: 0 moves to the start of the file, just before the a 1 moves to just before the Ω 3 moves to just before the λ 5 moves to just before the z 6 moves to after the z (EOF). Other encodings may give different cookies. If you seek() to position 4, say, the results will be unpredictable but probably not anything good. In other words, the tell() and seek() cookies represent file positions in **bytes**, even though we are reading or writing a text file. You would like the cookies to be file positions measured in **characters** (or to be precise, code points). Am I close? -- Steve

On 5/26/22, Steven D'Aprano <steve@pearwood.info> wrote:
To clarify the general context, text I/O tell() and seek() cookies aren't necessarily just a byte offset. They can be packed integers that include a start position, decoder flags, a number of bytes to be fed into the decoder, whether the decode operation should be final (EOF), and the number of decoded characters (ordinals) to skip. For example: >>> open('spam.txt', 'w', encoding='utf-7').write('\u0100'*4) 4 >>> f = open('spam.txt', encoding='utf-7') >>> f.read(2) 'ĀĀ' >>> f.tell() 680564734871843039612185603579607777280 >>> start_pos, dec_flags, bytes_to_feed, need_eof, chars_to_skip = ( ... _pyio.TextIOWrapper._unpack_cookie(..., f.tell())) >>> start_pos, dec_flags, bytes_to_feed, need_eof, chars_to_skip (0, 55834574848, 2, False, 0)

On Thu, 26 May 2022 at 22:07, Eryk Sun <eryksun@gmail.com> wrote:
If I'm reading this correctly, the result from f.tell() has enough information to reconstruct a position within a hypothetical array of code points contained within the file (that is to say - if you read the entire file into a string, f.tell() returns something that can be turned into an index into that string), but that position might not actually correspond to a single byte location. Is that it? I think UTF-7 is an awesome encoding. Really good at destroying people's expectations of what they thought they could depend on. (Terrible for actually using, though.) ChrisA

Chris Angelico writes:
That's what the OP wants. That's not what f.tell does. f.tell returns information sufficient to recreate the state of the stream I/O part of a codec when it reaches that point in the stream. Its *purpose* is to support producing the rest of the str produced by f.read() after a shorter read, but f.tell doesn't care if the str ever existed or ever will exist in the process that calls it.
Better than the alternatives for its intended use cases, which tells you a lot about the hostile environment RFC-822 created for the rest of the world. ;-)

On Fri, 27 May 2022 at 12:01, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
Well, the OP wants to be able to do arithmetic on it, but what actually happens is that it's nothing more than a magic cookie. But it ought to be able to recreate the state at any point, after returning any number of characters. Right? There's no way that it would ever fail to get to the same position in a character stream. It's just that there's no way to synthesize that, you have to start at the beginning. ChrisA
participants (7)
-
Barry Scott
-
Chris Angelico
-
Christopher Barker
-
Eryk Sun
-
mguinhos@gmail.com
-
Stephen J. Turnbull
-
Steven D'Aprano