ESR's fortune.pl redone in python - request for critique

Tue Mar 30 23:03:46 EST 2004

Thanks to Pádraig Brady, Joakim Hove, Peter Otten, and Mel Wilson for
taking time to respond to my request. I learned quite a bit.

I have downloaded the rotagator tarball and am in the process of
investigating the source code to pick up some ideas, and will
recommend it to my friends who use the Mozilla mail client.

As for the other suggestions:

(JH)> Using the same variable name with only difference in
(JH)> capitalization I consider very bad taste. I would suggest using
(JH)> for instance the variable name 'default_fortunes_file' instead
(JH)> of FORTUNES_FILE.

I am inclined to agree with you in general, but as I am in the habit
of treating my default values as constants (all-caps), I have also
developed the habit of naming their variable counterparts with the
same words, but lowercase. My rationale was that the two names
represent exactly the same idea, but in different circumstances. I am
of the opinion that even better than this, and better than your
suggested change, is to just use the same identifier to begin with.

> > fp_entry = [0]
> 
(JH)> You probably want to start with an *empty* list, not a list
(JH)> with '0' as the first element, i.e.
(JH)> 
(JH)> fp_entry = []

In fact, I did intend to have 0 as the first element. My thinking is
thus: the first entry in the fortunes file is not preceded by a '%',
and therefore will not be found by the code that follows unless I
include 0 as one of the possible fp values. I do not know perl, and
am not an experienced programmer, so in thinking that this was the
reason for ESR's similar use, I may be proceeding from a position of
ignorance.

> > fortune = ''
> > line = fi.readline()
> > while line != '':
> >     if re.match(r'^%$', line):
> >         break
> >     fortune += line
> >     line = fi.readline()
> > print fortune
> 
(JH)> Wouldn't it be more natural to look for '%' instead of blank
(JH)> lines:
(JH)> 
(JH)> <Untested>
(JH)> fortune = ''
(JH)> line = fi.readline()
(JH)> while line:
(JH)>     if re.match(r'^%$',line):
(JH)>         break
(JH)>     fortune += line
(JH)>     line = fi.readline()
(JH)> </Untested>

Yes, but I can't seem to get it through my thick skull that Python
was designed well enough to allow me to use datatypes sensibly as
booleans. (By the way, the code has now been tested and [predictably]
found to work.)

(JH)> I like to close the files explicitly, like you do. However it
is
(JH)> not really necessary, they are closed automatically (by the
(JH)> garbage collector I guess).

I know it's not necessary, but it does no harm (right?) and I think
it's a good habit to get myself into, so that I do not try
scripting-style file processing in a language like C (when I get to
that) where it could bite me (or other users) hard.

(JH)> Repositioning the

I wish this hadn't been cut off - I assume that I was going to
receive some information on repositioning the file pointer. Is it
unnecessary to reposition the file pointer, or is it redundant (in
that it is done upon closing the file)?

> > # Let's see if we can open that file for reading
> > try:
> >     fi = open(fortunes_file, 'r')
> > except:
> 
(PO)> Make it a rule to catch something, e. g.
(PO)> 
(PO)> try:
(PO)>    ...
(PO)> except IOError:
(PO)>    ...
(PO)> 
(PO)> That reduces the chance that an unexpected error is mapped to
(PO)> something else.

I am sure that many would say that this is too simple and little a
script to be doing much exception handling beyond basics, but I am
glad you saw the intent of my request.

So if I am understanding correctly, I want to do the following:

try:
    fi = open(fortunes_file, 'r')
except IOError:
    sys.exit("Cannot open fortunes file %s." % fortunes_file)

Out of curiosity, what else could the error be mapped to, and how
does that work, or where can I go to read up on such issues? Also, in
this particular case, where I am simply exiting the program with an
error message, could the error possibly be mapped to anything else,
and if so, would it really matter?

> > # Return fp to beginning of file, close the file, and exit
> program
> > fi.seek(0,0)
> 
(PO)> Code has no effect. I would remove it.

This has no effect because once the program exits, the file pointer
dies, or for another reason?

> 
> > fi.close()
> > sys.exit()
> 
(PO)> No need to call sys.exit() here.
(PO)> 
(PO)> My impression is that you are messing with file "pointers" too
(PO)> much which I would consider rather low-level. For C this is OK
(PO)> as it has no decent support for strings, but in Python I would
(PO)> suggest to relax and process the whole file as a string or list
(PO)> of lines.

You are correct. My intention was to perform a direct Perl to Python
translation from a slightly modified version of ESR's script. Since I
am not very well-acquainted with Perl, this alone caused me a bit of
headache.

However, this raises new questions for me. Reading the file into a
data structure (whatever that structure may be) seems to be good for
two reasons: 1) I needn't keep the file open for longer than
necessary, and 2) less time is wasted waiting for I/O to access the
file. But the way I have it also reads the pertinent parts of the
file into a data structure, and then processes those. Because I am
not processing the file, and because I need not read the whole file
contents in, I thought that this method of using file pointers would
be more efficient. What do you think? Also, my thinking is that the
file that this script works on may be very large. At what point do
the memory requirements of reading the contents of all of those files
into a structure outweigh the read I/O access times for retrieving
all relevant possible file pointer positions and then reading once
more to find that one fp position, and if it ever does, is not a more
scalable approach to use file pointer positions from the start? 

(MW)> Or I could see (untested code)
(MW)> 
(MW)> import random, sys
(MW)> 
(MW)> def fortunes (infile):
(MW)>     return infile.read().split ('\n%\n')
(MW)> 
(MW)> def findfortune (filename):
(MW)>     return random.choice (fortunes (file (filename, 'rt'))
(MW)> 
(MW)> if __name__ == '__main__':
(MW)>     print findfortune (sys.argv[1])

This is impressive for its brevity, but my question about the memory
efficiency of storing the entire contents of large files applies.
Also, what is 'tr' in the arguments list for file()? I looked up
file() in the library reference and it doesn't mention t - only a, b,
r, w, +, and U. A typo, or something I am unaware of?

(PO)> That's what I meant with "in Python I would suggest to relax
(PO)> and process the whole file as a string or list of lines" in my
(PO)> above post.

(PO)> Here's a different approach that reads fortunes one at a time
(PO)> with a generator function that collects lines until it
(PO)> encounters a "%\n". I'm using an algorithm to choose a random
(PO)> fortune that was posted on c.l.py by Paul Rubin....
(PO)>
(PO)> import random, sys
(PO)> 
(PO)> def fortunes(infile):
(PO)>     """ Yield fortunes as lists of lines """
(PO)>     result = []
(PO)>     for line in infile:
(PO)>         if line == "%\n":
(PO)>             yield result
(PO)>             result = []
(PO)>         else:
(PO)>             result.append(line)
(PO)>     if result:
(PO)>         yield result
(PO)> 
(PO)> def findfortune(filename):
(PO)>     """ Pick a random fortune from a file """
(PO)>     for index, fortune in enumerate(fortunes(file(filename))):
(PO)>         if random.random() < (1.0 / (index+1)):
(PO)>             chosen = fortune
(PO)> 
(PO)>     return "".join(chosen)
(PO)> 
(PO)> if __name__ == "__main__":
(PO)>     print findfortune(sys.argv[1]),

The randomization in findfortune() is interesting and would never
have occurred to me. The problem with it (as I see it - I could be
wrong) is that it results in a sub-pseudo random choice, as I can
easily imagine that entries towards the end of a large file will be
much less likely to be chosen due to the randomizer being applied not
to a set but to each element of the set, and because a positive for
any element means that none of the following elements are considered
for randomization, the result could well be that some entries will
effectively never be chosen.

(PO)> The point of my solution is to never have (a) the whole file in
(PO)> memory and (b) to make a random choice from a set of
(initially)
(PO)> unknown size while you go through it.

This is a very interesting solution in light of point (b), but of
course I will always know the size of the set as long as it comes
from a file. As for point (a), in the case that the last entry of a
file is chosen, wouldn't the whole file then have been in memory at
the same time, or do values returned by generators die right after
their use? In addition, simply opening the file puts its entire
contents into memory, does it not? (So this runs counter to what I
said about I/O access times for opened files - please set me straight
one way or the other; are file objects created with their contents
loaded into memory, or do their contents remain in storage until
needed?)

In any case, the idea of avoiding the storage of the file's contents
in memory is (as I understand it) the reason for using a list of file
positions.

Thank you very much for your feedback. I am learning...

- Jeremy
adeleinandjeremy at yahoo.com

__________________________________
Do you Yahoo!?
Yahoo! Finance Tax Center - File online. File on time.
http://taxes.yahoo.com/filing.html