[Tutor] comparing files
Brian van den Broek
bvande at po-box.mcgill.ca
Wed Sep 15 22:08:10 CEST 2004
D Elliott said unto the world upon 2004-09-15 14:52:
> I am completely new to programming and have been learning Python for about
> a week. I have looked through and worked through the first few chapters
> of:
>
> - Python Tutorial (Rossum et al)
> - Non-Programmers Tutorial for Python (Cogliati)
> - Learn to program using Python (Gauld)
> - How to think like a computer scientist (Downey et al)
>
> For my PhD in machine translation evaluation, my first programming task is
> to try to automatically detect (and then count) all words that were not
> translated (into English) by the system (ie. they are still in French).
> The idea I have is as follows:
>
> - Read in a file containing MT output (usually about 400 words)
> - Compare it with a file containing a complete English word list
> - Print all words that do not appear in the wordlist in a separate file
> - Count the words in the file and print the percentage of not found words
> (The assumption is that these will be untranslated words - obviously this
> will have to be tested and tweaked)
>
> I now know how to read and write files, but not compare them. Would you
> say this is a particularly advanced task to do? My supervisor seemed to
> think that I could learn how to do this within a week by just skimming
> through the books and finding the relevant code. Is this realistic for a
> complete beginner? I, on the other hand, prefer to fully understand what I
> am doing! (BTW - my supervisor does not know Python)
>
> Could anyone please tell me how long you think it should take a keen
> beginner to get to that level, and which aspects of Python would you
> recommend that I learn first? Does anyone know of a book/tutorial that
> shows how to do the above tasks?
>
> Thanks in advance to anyone who can enlighten me:)
> Debbie
Hi Debbie,
I'm learning Python as a hobby and distraction from my thesis in
Philosophy, so I'm no expert. But I'd be surprised if anyone in a comp sci
related PhD program would take a week to learn enough Python to do what
you describe. It took me less than a few full days worth of effort (albeit
spread over a few weeks) to be confident in doing similar tasks.
Some general learning advice:
I started with How to think like a computer scientist. Finding it a bit
low in its pitch (I believe it is aimed at high school students) I used
Lutz and Ascher Learning Python
<http://www.oreilly.com/catalog/lpython2/>. It is likely worth a purchase.
Though, depending on your uni's arrangements, you might be able to read it
online for free through safari <http://safari.oreilly.com/>
Also, if you intend to use Python in anger, I'd suggest buying Martelli
Python in a Nutshell <http://www.oreilly.com/catalog/pythonian/>. Its not
so much for learning (at least early on) as it is a very useful memory
jogger. I have, though, used it to learn how to do a number of things,
too. This one, safari or no, you will want to have at hand.
And, since you posted to the Tutor list, you found one of the very best
resources already!
I think you'd likely learn more skimming through the docs and trying to
build it from scratch than you would skimming through books looking for
code to use. Perhaps more knowledgeable folks will disagree, but at the
early stages, learning how to do it from scratch seems better to me even
though it does overlook the great strength of the open source community
that you get to stand on the shoulders of giants.
Advice about your task:
I'm going to make the simplifying assumptions a) that there are only ASCII
characters at play and b) no words in your MT output file have line-ending
hyphenations.
What I would do as a first approach to this would (in broad outline) be:
1) read both the MT output and the reference word files into strings,
using the .read() method of a file object. This will give you two strings,
each of which is the contents of one of the original files. Then,
2) Split each file contents string at whitespace to separate them into
words (assumption (b) kicking in here), using the .split() method of the
string object. This will give you two lists, each of all the words in the
original files. (You might also use .lower() on the original strings to
discard case differences.) Then,
3) for each element in the MT output word list, check if it is in the
reference word list. That will need a for loop and the in keyword. Using
if test, augment appropriate counters as you go.
There are several ways that I can think of where you could speed this up,
and surely a good many more that I haven't seen. For instance, once you
get something like that going, you might think about breaking the standard
word list up into sub-lists, one for words that start with 'a', etc. (This
would reduce how many comparisons you have to make for each word.) You
might also look to serialize (or store) those canonical word lists to save
the step of constructing them each time. But, once you've got it done as I
outline above, you should be well on your way to knowing how to improve it
in these or other ways.
Doc pages that you will find helpful -- these are also in your Python
installation on (many?/all?) platforms:
http://www.python.org/doc/2.3.4/lib/string-methods.html
http://www.python.org/doc/2.3.4/lib/typesseq-mutable.html
http://www.python.org/doc/2.3.4/lib/built-in-funcs.html
These will cover various methods that you will find useful. I'd suggest
looking through them briefly so you get a general 'lay of the land' and
then consulting in detail if/as the need arises.
Above all, though: remember this advice is from a relative newcomer!
Good luck and best,
Brian vdB
More information about the Tutor
mailing list