ESR's fortune.pl redone in python - request for critique

Wed Mar 31 15:17:32 EST 2004

Thanks again for all of the feedback.

(JH)> My second comment was that it is not necessary to explicitly
(JH)> exit the program with 'sys.exit()'.

I was aware of this, but it is a bad habit of mine (philosopher's
disease?) to state the obvious even if it has already been stated
elsewhere (when the interpreter terminates process and exits,
right?). sys.exit() is only necessary for properly terminating event
loops, is that correct?

(JH)> I am impressed with the way you followed up on this.

I am not quite sure what you mean by this (how else would I have done
so?), but thank you.

>> In fact, I did intend to have 0 as the first element. My thinking
>> is thus: the first entry in the fortunes file is not preceded by a
>> '%', and therefore will not be found by the code that follows
>> unless I include 0 as one of the possible fp values. I do not know
>> perl, and am not an experienced programmer, so in thinking that
>> this was the reason for ESR's similar use, I may be proceeding
>> from a position of ignorance.
>  
(PO)> I think you're right again.

I am right about the first element needing to be zero, xor that I am
proceeding from a position of ignorance? :-)

(PO)> A wrong error message may be worse than no error message.
(PO)> Consider
(PO)> 
(PO)> try:
(PO)>    fi = open(fortune_file)
(PO)> except:
(PO)>     sys.exit("Cannot open fortunes file %s." % fortunes_file)
(PO)> 
(PO)> That might mislead you to look for problems with the file while
(PO)> it's just an ordinary spelling error.

I see your point, I think. I am understanding that I should simply
exit upon IOError exception, with no error message at all, so that an
user is not led to believe that the file may be corrupt when in fact
it may not exist, or the filename passed was misspelled, etc. - this
point is good to keep in mind for all error messages, I will add it
to my thinking.

(PO)> You can think of seek() for a readonly file as positioning a
(PO)> pointer to the start of the next read() operation. As no more
(PO)> reads will follow, such a repositioning is useless.

...And therefore inefficient....

(JH)> Repositioning the file pointer is not necessary, this pointer
(JH)> just vanishes when you close the file anyway.

...So I correctly understood the comments on this subject. Thank you.

(PO)> As you can see from Mel Wilson's code, 
(PO)> 3) it also greatly simplifies your program, which means it will
(PO)> be less errorprone and take less time to write and less time to
(PO)> understand for others. Also, I expect it to be the fastest for
(PO)> small files - and the size so called "small" files is still
(PO)> growing fast these days.
(PO)> 
(PO)> Yes, Mel's approach is the least scalable (it keeps the whole
(PO)> file), followed by yours (keeps a list of positions for every
(PO)> items). Mine is best in that regard :-), because it only keeps
(PO)> two items (reducing that to one is left as an exercise...) at
(PO)> any time. But if you really need performance and many entries,
(PO)> I'd guess that putting the items into a database would defeat
(PO)> them all.

So, Mel's version for small files (which applies 99% of the time to
fortune files), mine to show how Perl transforms into Python
statement-by-statement, and yours (once I get a handle on it) for
'production' use.

> Also, what is 'tr' in the arguments list for file()? I looked up
> file() in the library reference and it doesn't mention t - only a,
> b, r, w, +, and U. A typo, or something I am unaware of?
(PO)>
(PO)> t translates "\r\n" to "\n" on windows. I think this is
(PO)> superseded by "U" these days, which guarantees "\n" as newline
(PO)> on windows, unix and mac alike.

I thought that the conversion of newline/carriage return sequences
was automatic as long as files are not read in under binary mode. Is
that not the case? I still find no mention of t in the ref
(http://python.org/doc/2.3.3/lib/built-in-funcs.html#l2h-25). Is it
platform dependent, or am I looking at the wrong docref?

> The randomization in findfortune() is interesting and would never
> have occurred to me. The problem with it (as I see it - I could be
> wrong) is that it results in a sub-pseudo random choice, as I can
> easily imagine that entries towards the end of a large file will be
> much less likely to be chosen due to the randomizer being applied
> not to a set but to each element of the set, and because a positive
> for any element means that none of the following elements are
> considered for randomization, the result could well be that some
> entries will effectively never be chosen.
(PO)>
(PO)> You are getting this wrong. I hoped to evade the explanation by
(PO)> providing the reference, but now I'll give it a try. You can
(PO)> prove the right outcome by induction, but I'll resort to "proof
(PO)> by example" for now. Let's consider a small sample
(PO)>
(PO)> for index, item in enumerate(["a", "b", "c", "d"]):
(PO)>     if random() < (1.0/(index+1)):
(PO)>         chosen = item
(PO)> 
(PO)> random() yields a floating point number 0 <= n < 1.0
(PO)> 
(PO)> Now let's unroll the loop:
(PO)> 
(PO)> if random() <1.0/1 : chosen = "a" #index is 0, item is "a"
(PO)> if random() <1.0/2 : chosen = "b" #index is 1, item is "b"
(PO)> if random() <1.0/3 : chosen = "c" #index is 2, item is "c"
(PO)> if random() <1.0/4 : chosen = "d" #index is 3, item is "d"
(PO)> 
(PO)> The probability for the if branches to be executed is thus
(PO)> decreasing: 1, 0.5, 0.33, 0.25. Now look at it backwards: The
(PO)> chance that d is chosen is 0.25, the total probability for a,
(PO)> b, or c is then 1-0.25 = 0.75. When we now look at the first
(PO)> three lines only, we see immediately that the chance for c is
(PO)> one third. 1/3 of 75 percent is again 0.25. The remaining
(PO)> probability for a and b is then p(a or b) = 1 - p(c) - p(d) =
(PO)> 0.5. Now look at the first two lines only to distribute the
(PO)> total probability of 0.5 over a and b. Do you see the pattern?

Yes, and I am sorry about that - I did not understand your example as
well as I thought I did, and I totally missed the link you included -
lucky I didn't delete that email. What you are saying is
crystal-clear to me, but I need to really go over it to comprehend it
fully. In any case, I see that the probabilities do not work out as I
thought.

> This is a very interesting solution in light of point (b), but of
> course I will always know the size of the set as long as it comes
> from a file. 
(PO)> 
(PO)> Again, you have to build a list of positions, with size
(PO)> proportional to the file, I only keep the last fortune item,
(PO)> essentially constant size.

Point noted. I had missed it, obviously.

> As for point (a), in the case that the last entry of a
> file is chosen, wouldn't the whole file then have been in memory at
> the same time, or do values returned by generators die right after
> their use? 
(PO)> 
(PO)> That is up to the garbage collection; I'm working with the
(PO)> assumption that unreachable data, i. e. previously read lines
(PO)> are garbage-collected in a timely manner.

So in theory, they die. I am a complete newb as regards generators
(and many, many other things) - I am incapable of explaining them to
those who do not know what they are. Now I am one step closer to
being able to do so.

(PO)> The operating system and Python may both perform some caching
as
(PO)> they see fit, but conceptually (and practically at least for
(PO)> large files - they can still be larger than your RAM after all)
(PO)> they are *not* read into memory.

So ``creating a file object'' is not the best terminology (? -
because creating other objects puts all of their attributes in a
defined memory space) - instead, I am creating an object with access
to a specific file (and of course, with methods to work on that
file)?

- Jeremy
adeleinandjeremy at yahoo.com

__________________________________
Do you Yahoo!?
Yahoo! Finance Tax Center - File online. File on time.
http://taxes.yahoo.com/filing.html