writing from file into dictionary

Mon Nov 11 08:01:36 EST 2002

On Monday 11 November 2002 01:15 pm, Brad Hards wrote:
> On Mon, 11 Nov 2002 22:16, Alex Martelli wrote:
> <snip>
>
> > Assuming the trailing colon at the end of the 'rewrite' part
> > (and anything following it, such as the comment in the last
> > line) is to be removed; and that corresponding to each head
> > you want a list of the possible rewrites for that head;
>
> Can I try an explanation?

Sure!  Sorry for not providing one earlier -- people are often happier
to get just the working-fragment rather than all the spiel behind it.

> > grammardict = {}
>
> This creates a dictionary, and provides a references that dictionary, which
> is "grammardict". We'll need this later.

Right.  The dict object starts out empty.

> > for line in open('thefile'):
>
> open() is another name for file() in 2.2

Yes.  It's the older name, so it's the one I'm used to using;-).

> When used this way, open() or file(), takes a default of opening in read
> only mode, with unbuffered I/O.

Yes, and no.  Yes, read-only mode (and text-mode too, on platform where
it matters); but no, unbuffered I/O is not the default -- that would damage
performance substantially in this default-openiong case.

> open() returns a file object, which the for loop iterates over. This magic

Yes.

> iteration only works on later versions of python (ie not 1.5.2 :). When and

Indeed: only Python 2.2 and later.  In earlier versions, you could use:

for line in open('thefile').readlines():

as long as the file isn't too large to fit all in memory at once comfortably.
For potentially-huge files in old Python versions, you'd need to use less
legible and handly idioms, which is why the new one was added in 2.2.

> if the iteration does work, the name "line" refers to each line in the
> file.
>
> >     head, rewrite = line.split(':',2)[:2]
>
> "line.split(':'.2)" means to create a list of up to three items from the

Yes, but I think it's easier to remember as: split up to 2 times (i.e., in
correspondence to two occurrences of colon).

> text referenced by "line", by taking all the characters up to the
> delimiting colon (":"). We need a list of three items to avoid the second
> colon being attached to the second item.

Yes, the last item holds all that follows the second colon (including
possibly more colons, we don't care).

The slightly simpler statement:

    head, rewrite = line.split(':')[:2]

works just as well in this case -- it just generates as many items as
there are colons in the line, plus one, and takes the first two as the
values to bind to names 'head' and 'rewrite' -- so, the overall effect
is just the same as the way I had written it.  It's just a habit for me
to specify the maximum number of splitter-occurrences to consider
when I know there is such a limit -- the performance advantage is tiny,
but I do tend to get into the habit of specifying things explicitly when
that has no substantial downsides, as in this case.

> Having got our list of three items, we create a slice consisting of the
> first two items. This sliced list is then unpacked into two variables -
> "head" and "rewrite". You could also use:
> head, rewrite = filter(None, line.split(':',2))
> but I guess that is less efficient, otherwise Alex would have used it.

That would never occur to me -- there are no specs about skipping
empty pieces; and it would fail just about every time, e.g.:

>>> line = 'a:b:\n'
>>> head, rewrite = filter(None, line.split(':',2))
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
ValueError: unpack list of wrong size
>>>

as we didn't give any name to use for the last piece ('\n', which isn't
filtered-out of course -- it's not the empty string).

More generally, I don't always use the most efficient idiom -- first,
I don't always KNOW which idiom is going to be most efficient for
any given case; second, "premature optimization is the root of all
evil in programming" and "do the simplest thing that can possibly
work", two principles I try to live by, both suggest picking simplicity
over (assumed) efficiency.

> >     grammardict.setdefault(head, []).append(rewrite)
>
> This searches the dictionary refered to by "grammarlist" for an list that

grammardict (simple typo, of course).

> is keyed by a string equal to "head", creating an empty list entry if it
> doesn't already exist. The "append" part then operates on the string
> referred to by the "rewrite" variable, adding it to the (possibly empty)
> list keyed by "head".

Spot-on.

> To rephrase:
> If grammardict does not include any entry keyed by "head", then an entry is
> created, consisting of an empty list. "rewrite" is then appended to that
> empty list, creating a dictionary entry of head:[rewrite].
> If grammardict does include an entry keyed by "head", then "rewrite" is
> appended to the existing list, resulting in a dictionary entry of head: [
> existing, rewrite].

Yes.  Python 1.5.2 idioms equivalent to this statement include:

if not grammardict.has_key(head):
    grammardict[head] = []
grammardict[head].append(rewrite)

Besides being more compact, the current Python idiom of setdefault
does the job of indexing grammardict just once, rather than two or
three times like the 1.5.2-compatible snippet.  Even then least
performance-obsessed programmer will often balk at doing things
two or three times when doing them once suffices;-).

> How close is this?

Very!  The tidbit about unbuffered opening, and the filter(None ... 
strangeness, are the only two details in your commentary that I
think are erroneous -- all the rest of my commentary about your
commentary is just expanding it or mentioning side issues.

Alex