[Python-3000] encoding hell

Guido van Rossum guido at python.org
Wed Sep 6 03:09:21 CEST 2006

On 9/4/06, Oleg Broytmann <phd at oper.phd.pp.ru> wrote:
> On Sun, Sep 03, 2006 at 01:45:28PM -0700, Aahz wrote:
> > On Sun, Sep 03, 2006, Marcin 'Qrczak' Kowalczyk wrote:
> > > "tomer filiba" <tomerfiliba at gmail.com> writes:
> > >>
> > >> file("foo", "w+") ?
> > >
> > > What is a rationale of this operation for a text file?
> >
> > You want to be able to read the file and write data to it.  That argues
> > in favor of seek(0) and seek(-1) being the only supported behaviors,
> > though.

Umm, where he wrote seek(-1) he probably meant seek(0, 2) which is how
one seeks to EOF.

>    Sometimes programs need tell() + seek(). Two examples (very similar,
> really).
>    Example 1. I have a program, an email robot that receives email(s) and
> marks email addresses in a "database" that is actually a text file:
> --- email database file ---
>  phd at phd.pp.ru
>  phd at oper.med.ru
> --- / ---
>    The program opens the file in "r+" mode, reads it line by line and
> stores the positions of the first character in an every line using tell().
> When it needs to mark an email it seek()'s to the stored position and write
> '+' mark so the file looks like
> --- email database file ---
> +phd at phd.pp.ru
>  phd at oper.med.ru
> --- / ---

I don't understand how it can insert a character into the file without
rewriting everything after that point.

But it does remind me of a use case for tell+seek on a read-only text
file. An email-reading program may have a text-based multi-message
mailbox format (e.g. UNIX mailbox format) and build an in-memory index
of seek positions using a quick initial scan (or scanning as it goes).
Once it has computed the position of a message it can quickly seek to
its start and display that message.

Granted, typical mailbox formats tend to use ASCII only. But one could
easily imagine a similar use case for encoded text files containing
multiple application-specific sections.

As long as the state of the decoder is "neutral" at the start of a
line, it should be possible to do this. I like the idea that tell()
returns a "cookie" which is really a byte offset. If one wants to be
able to seek to positions with a non-neutral decoder state, the cookie
would have to be more abstract. It shouldn't matter; text apps should
not do arithmetic on seek/tell positions.

>    Example 2. INN (the NNTP daemon) stores (at least stored when I was
> using it) information about newsgroup in a text file database. It uses
> another approach - it stores info using lines of equal length:
> --- newsgroups ---
> comp.lang.python                          000001234567
> comp.lang.python.announce                 000000abcdef
> --- / ---
>    Probably INN doesn't use tell() - it just calculates the position using
> line length. But a python program needs tell() and seek() for such a file.

--Guido van Rossum (home page: http://www.python.org/~guido/)

More information about the Python-3000 mailing list