[Tutor] comparing files

Wed Sep 15 22:08:10 CEST 2004

D Elliott said unto the world upon 2004-09-15 14:52:
> I am completely new to programming and have been learning Python for about
> a week. I have looked through and worked through the first few chapters
> of:
> 
> - Python Tutorial (Rossum et al)
> - Non-Programmers Tutorial for Python (Cogliati)
> - Learn to program using Python (Gauld)
> - How to think like a computer scientist (Downey et al)
> 
> For my PhD in machine translation evaluation, my first programming task is
> to try to automatically detect (and then count) all words that were not
> translated (into English) by the system (ie. they are still in French).
> The idea I have is as follows:
> 
> - Read in a file containing MT output (usually about 400 words)
> - Compare it with a file containing a complete English word list
> - Print all words that do not appear in the wordlist in a separate file
> - Count the words in the file and print the percentage of not found words
> (The assumption is that these will be untranslated words - obviously this
> will have to be tested and tweaked)
> 
> I now know how to read and write files, but not compare them. Would you
> say this is a particularly advanced task to do? My supervisor seemed to
> think that I could learn how to do this within a week by just skimming
> through the books and finding the relevant code. Is this realistic for a
> complete beginner? I, on the other hand, prefer to fully understand what I
> am doing! (BTW - my supervisor does not know Python)
> 
> Could anyone please tell me how long you think it should take a keen
> beginner to get to that level, and which aspects of Python would you
> recommend that I learn first? Does anyone know of a book/tutorial that
> shows how to do the above tasks?
> 
> Thanks in advance to anyone who can enlighten me:)
> Debbie

Hi Debbie,

I'm learning Python as a hobby and distraction from my thesis in 
Philosophy, so I'm no expert. But I'd be surprised if anyone in a comp sci 
related PhD program would take a week to learn enough Python to do what 
you describe. It took me less than a few full days worth of effort (albeit 
spread over a few weeks) to be confident in doing similar tasks.

Some general learning advice:

I started with How to think like a computer scientist. Finding it a bit 
low in its pitch (I believe it is aimed at high school students) I used 
Lutz and Ascher Learning Python 
<http://www.oreilly.com/catalog/lpython2/>. It is likely worth a purchase. 
Though, depending on your uni's arrangements, you might be able to read it 
online for free through safari <http://safari.oreilly.com/>

Also, if you intend to use Python in anger, I'd suggest buying Martelli 
Python in a Nutshell <http://www.oreilly.com/catalog/pythonian/>. Its not 
so much for learning (at least early on) as it is a very useful memory 
jogger. I have, though, used it to learn how to do a number of things, 
too. This one, safari or no, you will want to have at hand.

And, since you posted to the Tutor list, you found one of the very best 
resources already!

I think you'd likely learn more skimming through the docs and trying to 
build it from scratch than you would skimming through books looking for 
code to use. Perhaps more knowledgeable folks will disagree, but at the 
early stages, learning how to do it from scratch seems better to me even 
though it does overlook the great strength of the open source community 
that you get to stand on the shoulders of giants.

Advice about your task:

I'm going to make the simplifying assumptions a) that there are only ASCII 
characters at play and b) no words in your MT output file have line-ending 
hyphenations.

What I would do as a first approach to this would (in broad outline) be:

1) read both the MT output and the reference word files into strings, 
using the .read() method of a file object. This will give you two strings, 
each of which is the contents of one of the original files. Then,

2) Split each file contents string at whitespace to separate them into 
words (assumption (b) kicking in here), using the .split() method of the 
string object. This will give you two lists, each of all the words in the 
original files. (You might also use .lower() on the original strings to 
discard case differences.) Then,

3) for each element in the MT output word list, check if it is in the 
reference word list. That will need a for loop and the in keyword. Using 
if test, augment appropriate counters as you go.

There are several ways that I can think of where you could speed this up, 
and surely a good many more that I haven't seen. For instance, once you 
get something like that going, you might think about breaking the standard 
word list up into sub-lists, one for words that start with 'a', etc. (This 
would reduce how many comparisons you have to make for each word.) You 
might also look to serialize (or store) those canonical word lists to save 
the step of constructing them each time. But, once you've got it done as I 
outline above, you should be well on your way to knowing how to improve it 
in these or other ways.

Doc pages that you will find helpful -- these are also in your Python 
installation on (many?/all?) platforms:

http://www.python.org/doc/2.3.4/lib/string-methods.html
http://www.python.org/doc/2.3.4/lib/typesseq-mutable.html
http://www.python.org/doc/2.3.4/lib/built-in-funcs.html

These will cover various methods that you will find useful. I'd suggest 
looking through them briefly so you get a general 'lay of the land' and 
then consulting in detail if/as the need arises.

Above all, though: remember this advice is from a relative newcomer!

Good luck and best,

Brian vdB