writing from file into dictionary
Alex Martelli
aleax at aleax.it
Mon Nov 11 08:01:36 EST 2002
On Monday 11 November 2002 01:15 pm, Brad Hards wrote:
> On Mon, 11 Nov 2002 22:16, Alex Martelli wrote:
> <snip>
>
> > Assuming the trailing colon at the end of the 'rewrite' part
> > (and anything following it, such as the comment in the last
> > line) is to be removed; and that corresponding to each head
> > you want a list of the possible rewrites for that head;
>
> Can I try an explanation?
Sure! Sorry for not providing one earlier -- people are often happier
to get just the working-fragment rather than all the spiel behind it.
> > grammardict = {}
>
> This creates a dictionary, and provides a references that dictionary, which
> is "grammardict". We'll need this later.
Right. The dict object starts out empty.
> > for line in open('thefile'):
>
> open() is another name for file() in 2.2
Yes. It's the older name, so it's the one I'm used to using;-).
> When used this way, open() or file(), takes a default of opening in read
> only mode, with unbuffered I/O.
Yes, and no. Yes, read-only mode (and text-mode too, on platform where
it matters); but no, unbuffered I/O is not the default -- that would damage
performance substantially in this default-openiong case.
> open() returns a file object, which the for loop iterates over. This magic
Yes.
> iteration only works on later versions of python (ie not 1.5.2 :). When and
Indeed: only Python 2.2 and later. In earlier versions, you could use:
for line in open('thefile').readlines():
as long as the file isn't too large to fit all in memory at once comfortably.
For potentially-huge files in old Python versions, you'd need to use less
legible and handly idioms, which is why the new one was added in 2.2.
> if the iteration does work, the name "line" refers to each line in the
> file.
>
> > head, rewrite = line.split(':',2)[:2]
>
> "line.split(':'.2)" means to create a list of up to three items from the
Yes, but I think it's easier to remember as: split up to 2 times (i.e., in
correspondence to two occurrences of colon).
> text referenced by "line", by taking all the characters up to the
> delimiting colon (":"). We need a list of three items to avoid the second
> colon being attached to the second item.
Yes, the last item holds all that follows the second colon (including
possibly more colons, we don't care).
The slightly simpler statement:
head, rewrite = line.split(':')[:2]
works just as well in this case -- it just generates as many items as
there are colons in the line, plus one, and takes the first two as the
values to bind to names 'head' and 'rewrite' -- so, the overall effect
is just the same as the way I had written it. It's just a habit for me
to specify the maximum number of splitter-occurrences to consider
when I know there is such a limit -- the performance advantage is tiny,
but I do tend to get into the habit of specifying things explicitly when
that has no substantial downsides, as in this case.
> Having got our list of three items, we create a slice consisting of the
> first two items. This sliced list is then unpacked into two variables -
> "head" and "rewrite". You could also use:
> head, rewrite = filter(None, line.split(':',2))
> but I guess that is less efficient, otherwise Alex would have used it.
That would never occur to me -- there are no specs about skipping
empty pieces; and it would fail just about every time, e.g.:
>>> line = 'a:b:\n'
>>> head, rewrite = filter(None, line.split(':',2))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: unpack list of wrong size
>>>
as we didn't give any name to use for the last piece ('\n', which isn't
filtered-out of course -- it's not the empty string).
More generally, I don't always use the most efficient idiom -- first,
I don't always KNOW which idiom is going to be most efficient for
any given case; second, "premature optimization is the root of all
evil in programming" and "do the simplest thing that can possibly
work", two principles I try to live by, both suggest picking simplicity
over (assumed) efficiency.
> > grammardict.setdefault(head, []).append(rewrite)
>
> This searches the dictionary refered to by "grammarlist" for an list that
grammardict (simple typo, of course).
> is keyed by a string equal to "head", creating an empty list entry if it
> doesn't already exist. The "append" part then operates on the string
> referred to by the "rewrite" variable, adding it to the (possibly empty)
> list keyed by "head".
Spot-on.
> To rephrase:
> If grammardict does not include any entry keyed by "head", then an entry is
> created, consisting of an empty list. "rewrite" is then appended to that
> empty list, creating a dictionary entry of head:[rewrite].
> If grammardict does include an entry keyed by "head", then "rewrite" is
> appended to the existing list, resulting in a dictionary entry of head: [
> existing, rewrite].
Yes. Python 1.5.2 idioms equivalent to this statement include:
if not grammardict.has_key(head):
grammardict[head] = []
grammardict[head].append(rewrite)
Besides being more compact, the current Python idiom of setdefault
does the job of indexing grammardict just once, rather than two or
three times like the 1.5.2-compatible snippet. Even then least
performance-obsessed programmer will often balk at doing things
two or three times when doing them once suffices;-).
> How close is this?
Very! The tidbit about unbuffered opening, and the filter(None ...
strangeness, are the only two details in your commentary that I
think are erroneous -- all the rest of my commentary about your
commentary is just expanding it or mentioning side issues.
Alex
More information about the Python-list
mailing list