ESR's fortune.pl redone in python - request for critique

Wed Mar 31 08:04:52 EST 2004

Adelein and Jeremy wrote:

> (JH)> Using the same variable name with only difference in
> (JH)> capitalization I consider very bad taste. I would suggest using
> (JH)> for instance the variable name 'default_fortunes_file' instead
> (JH)> of FORTUNES_FILE.
> 
> I am inclined to agree with you in general, but as I am in the habit
> of treating my default values as constants (all-caps), I have also
> developed the habit of naming their variable counterparts with the
> same words, but lowercase. My rationale was that the two names
> represent exactly the same idea, but in different circumstances. I am
> of the opinion that even better than this, and better than your
> suggested change, is to just use the same identifier to begin with.

I'm with you here.

> 
>> > fp_entry = [0]
>> 
> (JH)> You probably want to start with an *empty* list, not a list
> (JH)> with '0' as the first element, i.e.
> (JH)>
> (JH)> fp_entry = []
> 
> In fact, I did intend to have 0 as the first element. My thinking is
> thus: the first entry in the fortunes file is not preceded by a '%',
> and therefore will not be found by the code that follows unless I
> include 0 as one of the possible fp values. I do not know perl, and
> am not an experienced programmer, so in thinking that this was the
> reason for ESR's similar use, I may be proceeding from a position of
> ignorance.

I think you're right again.

> (PO)> Make it a rule to catch something, e. g.
> (PO)>
> (PO)> try:
> (PO)>    ...
> (PO)> except IOError:
> (PO)>    ...
> (PO)>
> (PO)> That reduces the chance that an unexpected error is mapped to
> (PO)> something else.
> 
> I am sure that many would say that this is too simple and little a
> script to be doing much exception handling beyond basics, but I am
> glad you saw the intent of my request.
> 
> So if I am understanding correctly, I want to do the following:
> 
> try:
>     fi = open(fortunes_file, 'r')
> except IOError:
>     sys.exit("Cannot open fortunes file %s." % fortunes_file)
> 
> Out of curiosity, what else could the error be mapped to, and how
> does that work, or where can I go to read up on such issues? Also, in
> this particular case, where I am simply exiting the program with an
> error message, could the error possibly be mapped to anything else,
> and if so, would it really matter?

A wrong error message may be worse than no error message. Consider

try:
   fi = open(fortune_file)
except:
    sys.exit("Cannot open fortunes file %s." % fortunes_file)

That might mislead you to look for problems with the file while it's just an
ordinary spelling error.

>> > # Return fp to beginning of file, close the file, and exit
>> program
>> > fi.seek(0,0)
>> 
> (PO)> Code has no effect. I would remove it.
> 
> This has no effect because once the program exits, the file pointer
> dies, or for another reason?

You can think of seek() for a readonly file as positioning a pointer to the
start of the next read() operation. As no more reads will follow, such a
repositioning is useless.

>> > fi.close()
>> > sys.exit()
>> 
> (PO)> No need to call sys.exit() here.
> (PO)>
> (PO)> My impression is that you are messing with file "pointers" too
> (PO)> much which I would consider rather low-level. For C this is OK
> (PO)> as it has no decent support for strings, but in Python I would
> (PO)> suggest to relax and process the whole file as a string or list
> (PO)> of lines.
> 
> You are correct. My intention was to perform a direct Perl to Python
> translation from a slightly modified version of ESR's script. Since I
> am not very well-acquainted with Perl, this alone caused me a bit of
> headache.
> 
> However, this raises new questions for me. Reading the file into a
> data structure (whatever that structure may be) seems to be good for
> two reasons: 1) I needn't keep the file open for longer than
> necessary, and 2) less time is wasted waiting for I/O to access the
> file. 

As you can see from Mel Wilson's code, 
3) it also greatly simplifies your program, which means it will be less
errorprone and take less time to write and less time to understand for
others. Also, I expect it to be the fastest for small files - and the size
so called "small" files is still growing fast these days.

> But the way I have it also reads the pertinent parts of the
> file into a data structure, and then processes those. Because I am
> not processing the file, and because I need not read the whole file
> contents in, I thought that this method of using file pointers would
> be more efficient. What do you think? Also, my thinking is that the
> file that this script works on may be very large. At what point do
> the memory requirements of reading the contents of all of those files
> into a structure outweigh the read I/O access times for retrieving
> all relevant possible file pointer positions and then reading once
> more to find that one fp position, and if it ever does, is not a more
> scalable approach to use file pointer positions from the start?

Yes, Mel's approach is the least scalable (it keeps the whole file),
followed by yours (keeps a list of positions for every items). Mine is best
in that regard :-), because it only keeps two items (reducing that to one
is left as an exercise...) at any time.
But if you really need performance and many entries, I'd guess that putting
the items into a database would defeat them all.

> (MW)> Or I could see (untested code)
> (MW)>
> (MW)> import random, sys
> (MW)>
> (MW)> def fortunes (infile):
> (MW)>     return infile.read().split ('\n%\n')
> (MW)>
> (MW)> def findfortune (filename):
> (MW)>     return random.choice (fortunes (file (filename, 'rt'))
> (MW)>
> (MW)> if __name__ == '__main__':
> (MW)>     print findfortune (sys.argv[1])
> 
> This is impressive for its brevity, but my question about the memory
> efficiency of storing the entire contents of large files applies.
> Also, what is 'tr' in the arguments list for file()? I looked up
> file() in the library reference and it doesn't mention t - only a, b,
> r, w, +, and U. A typo, or something I am unaware of?

t translates "\r\n" to "\n" on windows. I think this is superseded by "U"
these days, which guarantees "\n" as newline on windows, unix and mac
alike.

> (PO)> That's what I meant with "in Python I would suggest to relax
> (PO)> and process the whole file as a string or list of lines" in my
> (PO)> above post.
> 
> (PO)> Here's a different approach that reads fortunes one at a time
> (PO)> with a generator function that collects lines until it
> (PO)> encounters a "%\n". I'm using an algorithm to choose a random
> (PO)> fortune that was posted on c.l.py by Paul Rubin....
> (PO)>
> (PO)> import random, sys
> (PO)>
> (PO)> def fortunes(infile):
> (PO)>     """ Yield fortunes as lists of lines """
> (PO)>     result = []
> (PO)>     for line in infile:
> (PO)>         if line == "%\n":
> (PO)>             yield result
> (PO)>             result = []
> (PO)>         else:
> (PO)>             result.append(line)
> (PO)>     if result:
> (PO)>         yield result
> (PO)>
> (PO)> def findfortune(filename):
> (PO)>     """ Pick a random fortune from a file """
> (PO)>     for index, fortune in enumerate(fortunes(file(filename))):
> (PO)>         if random.random() < (1.0 / (index+1)):
> (PO)>             chosen = fortune
> (PO)>
> (PO)>     return "".join(chosen)
> (PO)>
> (PO)> if __name__ == "__main__":
> (PO)>     print findfortune(sys.argv[1]),
> 
> The randomization in findfortune() is interesting and would never
> have occurred to me. The problem with it (as I see it - I could be
> wrong) is that it results in a sub-pseudo random choice, as I can
> easily imagine that entries towards the end of a large file will be
> much less likely to be chosen due to the randomizer being applied not
> to a set but to each element of the set, and because a positive for
> any element means that none of the following elements are considered
> for randomization, the result could well be that some entries will
> effectively never be chosen.

You are getting this wrong. I hoped to evade the explanation by providing
the reference, but now I'll give it a try. You can prove the right outcome
by induction, but I'll resort to "proof by example" for now.
Let's consider a small sample 

for index, item in enumerate(["a", "b", "c", "d"]):
    if random() < (1.0/(index+1)):
        chosen = item

random() yields a floating point number 0 <= n < 1.0

Now let's unroll the loop:

if random() <1.0/1 : chosen = "a" #index is 0, item is "a"
if random() <1.0/2 : chosen = "b" #index is 1, item is "b"
if random() <1.0/3 : chosen = "c" #index is 1, item is "c"
if random() <1.0/4 : chosen = "d" #index is 1, item is "d"

The probability for the if branches to be executed is thus
decreasing: 1, 0.5, 0.33, 0.25.
Now look at it backwards: The chance that d is chosen is 0.25, the total
probability for a, b, or c is then 1-0.25 = 0.75. When we now look at the
first three lines only, we see immediately that the chance for c is one
third. 1/3 of 75 percent is again 0.25. The remaining probability for a and
b is then p(a or b) = 1 - p(c) - p(d) = 0.5. Now look at the first two
lines only to distribute the total probability of 0.5 over a and b. Do you
see the pattern?

> 
> (PO)> The point of my solution is to never have (a) the whole file in
> (PO)> memory and (b) to make a random choice from a set of
> (initially)
> (PO)> unknown size while you go through it.
> 
> This is a very interesting solution in light of point (b), but of
> course I will always know the size of the set as long as it comes
> from a file. 

Again, you have to build a list of positions, with size proportional to the
file, I only keep the last fortune item, essentially constant size.

> As for point (a), in the case that the last entry of a
> file is chosen, wouldn't the whole file then have been in memory at
> the same time, or do values returned by generators die right after
> their use? 

That is up to the garbage collection; I'm working with the assumption that
unreachable data, i. e. previously read lines are garbage-collected in a
timely manner.

> In addition, simply opening the file puts its entire
> contents into memory, does it not? (So this runs counter to what I
> said about I/O access times for opened files - please set me straight
> one way or the other; are file objects created with their contents
> loaded into memory, or do their contents remain in storage until
> needed?)

The operating system and Python may both perform some caching as they see
fit, but conceptually (and practically at least for large files - they can
still be larger than your RAM after all) they are *not* read into memory.

Peter