[Python-Dev] Generalised String Coercion

"Martin v. Löwis" martin at v.loewis.de
Mon Aug 8 07:59:02 CEST 2005

Guido van Rossum wrote:
> I'm not sure if it works for all encodings, but if possible I'd like
> to extend the seeking semantics on text files: seek positions are byte
> counts, and the application should consider them as "magic cookies".

If the seek position is merely a number, it won't work for all
encodings. For the ISO 2022 ones (iso-2022-jp etc), you need to know
the shift state: you can switch to a different encoding in the stream
using standard escape codes, and then the same bytes are interpreted
differently. For example, iso-2022-jp supports these escape codes:

ESC ( B           ASCII
ESC $ @           JIS X 0208-1978
ESC $ B           JIS X 0208-1983
ESC ( J           JIS X 0201-Roman
ESC $ A           GB2312-1980
ESC $ ( C         KSC5601-1987
ESC $ ( D         JIS X 0212-1990
ESC . A           ISO8859-1
ESC . F           ISO8859-7

So at a certain position in the stream, the same bytes could mean
different characters, depending on which "shift state" you are in.
That's why ISO C introduced fgetpos/fsetpos in addition to
ftell/fseek: an fpos_t is a truly opaque structure that can also
incorporate codec state.

If you follow this approach, you can get back most of seek;
you will lose the "whence" parameter, i.e. you cannot seek forth
and back, and you cannot position at the end of the file
(actually, iso-2022-jp still supports appending to a file, since
it requires that all data "shift out" back to ASCII at the end
of each line, and at the end of the file. So "correct" ISO 2022
files can still be concatenated)

> Is there any reason not to do Universal Newline processing on *all*
> text files?

Correct. However, this still might result in a full rewrite of the
universal newlines code: the code currently operates on byte streams,
when it "should" operate on character streams. In some encodings,
CRLF simply isn't represented by \x0d\x0a
(e.g. UTF-16-LE: \x0d\0\0x0a\0)


More information about the Python-Dev mailing list