<div dir="ltr"><br><div class="gmail_extra"><br clear="all"><div><div dir="ltr"><div><font face="courier new, monospace"><br></font></div><font face="courier new, monospace"><br></font><div><font face="courier new, monospace">I guess this may help you</font></div>
<div><font face="courier new, monospace">--------------------------</font></div><div><font face="courier new, monospace"><br></font></div><div><font face="courier new, monospace"><br></font></div><div><font face="courier new, monospace"><div>
import operator</div><div><br></div><div>from string import whitespace as space</div><div>from string import punctuation as punc</div><div><br></div><div>class TextProcessing(object):</div><div> """."""</div>
<div> def __init__(self):</div><div> """."""</div><div> self.file = None</div><div> self.sorted_list = []</div><div> self.words_and_occurence = {}</div><div><br>
</div><div> def __sort_dict_by_value(self):</div><div> """."""</div><div> sorted_in_rev = sorted(self.words_and_occurence.items(), key=lambda x: x[1])</div><div> self.sorted_list = sorted_in_rev[::-1]</div>
<div><br></div><div> def __validate_words(self, word):</div><div> """."""</div><div> if word in self.words_and_occurence:</div><div> self.words_and_occurence[word] += 1</div>
<div> else:</div><div> self.words_and_occurence[word] = 1</div><div><br></div><div> def __parse_file(self, file_name):</div><div> """."""</div><div> fp = open(file_name, 'r')</div>
<div> line = fp.readline()</div><div> while line:</div><div> split_line = [self.__validate_words(word.strip(punc + space)) \</div><div> for word in line.split()</div><div>
if word.strip(punc + space)]</div><div> </div><div> line = fp.readline()</div><div> fp.close()</div><div><br></div><div> def parse_file(self, file_name=None):</div>
<div> """."""</div><div> if file_name is None:</div><div> raise Exception("Please pass the file to be parsed")</div><div> if not file_name.endswith(r".txt"):</div>
<div> raise Exception("*** Error *** Not a valid text file")</div><div><br></div><div> self.__parse_file(file_name)</div><div> </div><div> self.__sort_dict_by_value()</div><div><br>
</div><div> def print_top_n(self, n):</div><div> """."""</div><div> print "Top {0} words:".format(n), [self.sorted_list[i][0] for i in xrange(n)]</div><div><br></div>
<div> def print_unique_words(self):</div><div> """."""</div><div> print "Unique words:", [self.sorted_list[i][0] for i in xrange(len(self.sorted_list))]</div><div><br>
</div><div>if __name__ == "__main__":</div><div> """."""</div><div> obj = TextProcessing()</div><div> obj.parse_file(r'test_input.txt')</div><div> obj.print_top_n(4)</div>
<div> obj.print_unique_words()</div><div><br></div></font></div><div><br></div><div><font face="courier new, monospace"><br></font></div><div><br></div><div><font face="courier new, monospace"><br></font></div><div><font face="courier new, monospace"><b>-- Regards --</b></font></div>
<div><font face="courier new, monospace"><b><br></b></font></div><div><font face="courier new, monospace"><b> Siva Cn</b></font></div><div><font face="courier new, monospace" size="1"><b>Python Developer</b></font></div>
<div><font face="courier new, monospace"><b><br></b></font></div><div><font face="courier new, monospace" size="1"><b>+91 9620339598</b></font></div><div><font face="courier new, monospace" size="1"><b><a href="http://www.cnsiva.com" target="_blank">http://www.cnsiva.com</a></b></font></div>
<div><font face="courier new, monospace">---------------------</font><br></div></div></div>
<br><br><div class="gmail_quote">On Thu, Oct 17, 2013 at 7:58 PM, <span dir="ltr"><<a href="mailto:tutor-request@python.org" target="_blank">tutor-request@python.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
Send Tutor mailing list submissions to<br>
<a href="mailto:tutor@python.org">tutor@python.org</a><br>
<br>
To subscribe or unsubscribe via the World Wide Web, visit<br>
<a href="https://mail.python.org/mailman/listinfo/tutor" target="_blank">https://mail.python.org/mailman/listinfo/tutor</a><br>
or, via email, send a message with subject or body 'help' to<br>
<a href="mailto:tutor-request@python.org">tutor-request@python.org</a><br>
<br>
You can reach the person managing the list at<br>
<a href="mailto:tutor-owner@python.org">tutor-owner@python.org</a><br>
<br>
When replying, please edit your Subject line so it is more specific<br>
than "Re: Contents of Tutor digest..."<br>
<br>
<br>
Today's Topics:<br>
<br>
1. Re: Help please (Alan Gauld)<br>
2. Re: Help please (Peter Otten)<br>
3. Re: Help please (Dominik George)<br>
4. Re: Help please (Kengesbayev, Askar)<br>
<br>
<br>
----------------------------------------------------------------------<br>
<br>
Message: 1<br>
Date: Thu, 17 Oct 2013 14:13:07 +0100<br>
From: Alan Gauld <<a href="mailto:alan.gauld@btinternet.com">alan.gauld@btinternet.com</a>><br>
To: <a href="mailto:tutor@python.org">tutor@python.org</a><br>
Subject: Re: [Tutor] Help please<br>
Message-ID: <l3onop$oin$<a href="mailto:1@ger.gmane.org">1@ger.gmane.org</a>><br>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed<br>
<br>
On 16/10/13 19:49, Pinedo, Ruben A wrote:<br>
> I was given this code and I need to modify it so that it will:<br>
><br>
> #1. Error handling for the files to ensure reading only .txt file<br>
<br>
I'm not sure what is meant here since your code only ever opens<br>
'emma.txt', so it is presumably a text file... Or are you<br>
supposed to make the filename a user provided value maybe<br>
(using raw_input maybe?)<br>
<br>
> #2. Print a range of top words... ex: print top 10-20 words<br>
<br>
I assume 'top' here means the most common? Whoever is writing the<br>
specification for this problem needs to be a bit more specific<br>
in their definitions.<br>
<br>
If so you need to fix the bugs in process_line() and<br>
process_file(). I don;t know if these are deliberate bugs<br>
or somebody is just sloppy. But neither work as expected<br>
right now. (Hint: Consider the return values of each)<br>
<br>
Once you've done that you can figure out how to extract<br>
the required number of words from your (unsorted) dictionary.<br>
and put that in a reporting function and print the output.<br>
You might be able to use the two common words functions,<br>
although watch out because they don't do exactly what<br>
you want and one of them is basically broken...<br>
<br>
> #3. Print only the words with > 3 characters<br>
<br>
Modify the above to discard words of 3 letters or less.<br>
<br>
> #4. Modify the printing function to print top 1 or 2 or 3 ....<br>
<br>
I assume this means take a parameter that speciffies the<br>
number of words to print. Or it could be the length of<br>
word to ignore. Again the specification is woolly<br>
In either case its a small modification to your<br>
reporting function.<br>
<br>
> #5. How many unique words are there in the book of length 1, 2, 3 etc<br>
<br>
This is slicing the data slightly differently but<br>
again not that different to the earlier requirement.<br>
<br>
> I am fairly new to python and am completely lost, i looked in my book as<br>
> to how to do number one but i cannot figure out what to modify and/or<br>
> delete to add the print selection. This is the code:<br>
<br>
You need to modify the two brokemn functions and add a<br>
new reporting function. (Despite the reference to a<br>
printing function I'd suggest keeping the data extraction<br>
and printing seperate.<br>
<br>
> import string<br>
><br>
> def process_file(filename):<br>
> hist = dict()<br>
> fp = open(filename)<br>
> for line in fp:<br>
> process_line(line, hist)<br>
> return hist<br>
><br>
> def process_line(line, hist):<br>
> line = line.replace('-', ' ')<br>
> for word in line.split():<br>
> word = word.strip(string.punctuation + string.whitespace)<br>
> word = word.lower()<br>
> hist[word] = hist.get(word, 0) + 1<br>
><br>
> def common_words(hist):<br>
> t = []<br>
> for key, value in hist.items():<br>
> t.append((value, key))<br>
> t.sort(reverse=True)<br>
> return t<br>
><br>
> def most_common_words(hist, num=100):<br>
> t = common_words(hist)<br>
> print 'The most common words are:'<br>
> for freq, word in t[:num]:<br>
> print freq, '\t', word<br>
> hist = process_file('emma.txt')<br>
> print 'Total num of Words:', sum(hist.values())<br>
> print 'Total num of Unique Words:', len(hist)<br>
> most_common_words(hist, 50)<br>
><br>
> Any help would be greatly appreciated because i am struggling in this<br>
> class. Thank you in advance<br>
<br>
hth<br>
--<br>
Alan G<br>
Author of the Learn to Program web site<br>
<a href="http://www.alan-g.me.uk/" target="_blank">http://www.alan-g.me.uk/</a><br>
<a href="http://www.flickr.com/photos/alangauldphotos" target="_blank">http://www.flickr.com/photos/alangauldphotos</a><br>
<br>
<br>
<br>
------------------------------<br>
<br>
Message: 2<br>
Date: Thu, 17 Oct 2013 15:37:49 +0200<br>
From: Peter Otten <__<a href="mailto:peter__@web.de">peter__@web.de</a>><br>
To: <a href="mailto:tutor@python.org">tutor@python.org</a><br>
Subject: Re: [Tutor] Help please<br>
Message-ID: <l3op59$8n6$<a href="mailto:1@ger.gmane.org">1@ger.gmane.org</a>><br>
Content-Type: text/plain; charset="ISO-8859-1"<br>
<br>
Alan Gauld wrote:<br>
<br>
[Ruben Pinedo]<br>
<br>
> def process_file(filename):<br>
> hist = dict()<br>
> fp = open(filename)<br>
> for line in fp:<br>
> process_line(line, hist)<br>
> return hist<br>
><br>
> def process_line(line, hist):<br>
> line = line.replace('-', ' ')<br>
><br>
> for word in line.split():<br>
> word = word.strip(string.punctuation + string.whitespace)<br>
> word = word.lower()<br>
><br>
> hist[word] = hist.get(word, 0) + 1<br>
<br>
[Alan Gauld]<br>
<br>
> If so you need to fix the bugs in process_line() and<br>
> process_file(). I don;t know if these are deliberate bugs<br>
> or somebody is just sloppy. But neither work as expected<br>
> right now. (Hint: Consider the return values of each)<br>
<br>
I fail to see the bug.<br>
<br>
process_line() mutates its `hist` argument, so there's no need to return<br>
something. Or did you mean something else that escapes me?<br>
<br>
<br>
<br>
------------------------------<br>
<br>
Message: 3<br>
Date: Thu, 17 Oct 2013 16:17:27 +0200<br>
From: Dominik George <<a href="mailto:nik@naturalnet.de">nik@naturalnet.de</a>><br>
To: Todd Matsumoto <<a href="mailto:c.t.matsumoto@gmail.com">c.t.matsumoto@gmail.com</a>>,<a href="mailto:tutor@python.org">tutor@python.org</a><br>
Subject: Re: [Tutor] Help please<br>
Message-ID: <<a href="mailto:f310f0be-858d-48e2-ae88-5ad720518888@email.android.com">f310f0be-858d-48e2-ae88-5ad720518888@email.android.com</a>><br>
Content-Type: text/plain; charset=UTF-8<br>
<br>
-----BEGIN PGP SIGNED MESSAGE-----<br>
Hash: SHA512<br>
<br>
Todd Matsumoto <<a href="mailto:c.t.matsumoto@gmail.com">c.t.matsumoto@gmail.com</a>> schrieb:<br>
>> #1. Error handling for the files to ensure reading only .txt file<br>
>Look up exceptions.<br>
>Find out what the string method endswith() does.<br>
<br>
One should note that the OP probably meant files of the type text/plain rather than .txt files. File name extensions are a convenience to identify a file on first glance, but they tell absolutely nothing about the contents.<br>
<br>
So, look up MIME types as well ;)!<br>
<br>
- -nik<br>
-----BEGIN PGP SIGNATURE-----<br>
Version: APG v1.0.8-fdroid<br>
<br>
iQFNBAEBCgA3BQJSX/F3MBxEb21pbmlrIEdlb3JnZSAobW9iaWxlIGtleSkgPG5p<br>
a0BuYXR1cmFsbmV0LmRlPgAKCRAvLbGk0zMOJZxHB/9TGh6F1vRzgZmSMHt48arc<br>
jruTRfvOK9TZ5MWm6L2ZpxqKr3zBP7KSf1ZWSeXIovat9LetETkEwZ9bzHBuN8Ve<br>
m8YsOVX3zR6VWqGkRYYer3MbWo9DCONlJUKGMs/qjB180yxxhQ12Iw9WAHqam1Ti<br>
n0CCWsf4l5B3WBe+t2aTOlQNmo//6RuBK1LfCrnYX0XV2Catv1075am0KaTvbxfB<br>
rfHHnR4tdIYmZ8P/SkO3t+9JzJU9e+H2W90++K9EkMTBJxUhsa4AuZIEr8WqEfSe<br>
EheQMUp23tlMgKRp6UHiRJBljEsQJ0XFuYa+zj6hXCXoru/9ReHTRWcvJEpfXxEC<br>
=hJ0m<br>
-----END PGP SIGNATURE-----<br>
<br>
<br>
<br>
------------------------------<br>
<br>
Message: 4<br>
Date: Thu, 17 Oct 2013 14:21:17 +0000<br>
From: "Kengesbayev, Askar" <<a href="mailto:askar.kengesbayev@etrade.com">askar.kengesbayev@etrade.com</a>><br>
To: "Pinedo, Ruben A" <<a href="mailto:rapinedo@miners.utep.edu">rapinedo@miners.utep.edu</a>>, "<a href="mailto:tutor@python.org">tutor@python.org</a>"<br>
<<a href="mailto:tutor@python.org">tutor@python.org</a>><br>
Subject: Re: [Tutor] Help please<br>
Message-ID:<br>
<<a href="mailto:6FAD14604B087B438F6FF64D9875A40C68F5ADCA@atl1ex10mbx4.corp.etradegrp.com">6FAD14604B087B438F6FF64D9875A40C68F5ADCA@atl1ex10mbx4.corp.etradegrp.com</a>><br>
<br>
Content-Type: text/plain; charset="us-ascii"<br>
<br>
Ruben,<br>
<br>
#1 you can try something like this<br>
try:<br>
with open('my_file.txt') as file:<br>
pass<br>
except IOError as e:<br>
print "Unable to open file" #Does not exist or you do not have read permission<br>
<br>
#2. I would try to use regular expression push words to array and then you can manipulate array. Not sure if it is efficient way but it should work.<br>
#3 . easy way would be to use regular expression. Re module.<br>
#4. Once you will have array in #2 you can sort it and print whatever top words you need.<br>
#5. I am not sure the best way on this but you can play with array from #2.<br>
<br>
Thanks,<br>
Askar<br>
<br>
From: Pinedo, Ruben A [mailto:<a href="mailto:rapinedo@miners.utep.edu">rapinedo@miners.utep.edu</a>]<br>
Sent: Wednesday, October 16, 2013 2:49 PM<br>
To: <a href="mailto:tutor@python.org">tutor@python.org</a><br>
Subject: [Tutor] Help please<br>
<br>
I was given this code and I need to modify it so that it will:<br>
<br>
#1. Error handling for the files to ensure reading only .txt file<br>
#2. Print a range of top words... ex: print top 10-20 words<br>
#3. Print only the words with > 3 characters<br>
#4. Modify the printing function to print top 1 or 2 or 3 ....<br>
#5. How many unique words are there in the book of length 1, 2, 3 etc<br>
<br>
I am fairly new to python and am completely lost, i looked in my book as to how to do number one but i cannot figure out what to modify and/or delete to add the print selection. This is the code:<br>
<br>
<br>
import string<br>
<br>
def process_file(filename):<br>
hist = dict()<br>
fp = open(filename)<br>
for line in fp:<br>
process_line(line, hist)<br>
return hist<br>
<br>
def process_line(line, hist):<br>
line = line.replace('-', ' ')<br>
<br>
for word in line.split():<br>
word = word.strip(string.punctuation + string.whitespace)<br>
word = word.lower()<br>
<br>
hist[word] = hist.get(word, 0) + 1<br>
<br>
def common_words(hist):<br>
t = []<br>
for key, value in hist.items():<br>
t.append((value, key))<br>
<br>
t.sort(reverse=True)<br>
return t<br>
<br>
def most_common_words(hist, num=100):<br>
t = common_words(hist)<br>
print 'The most common words are:'<br>
for freq, word in t[:num]:<br>
print freq, '\t', word<br>
<br>
hist = process_file('emma.txt')<br>
print 'Total num of Words:', sum(hist.values())<br>
print 'Total num of Unique Words:', len(hist)<br>
most_common_words(hist, 50)<br>
<br>
Any help would be greatly appreciated because i am struggling in this class. Thank you in advance<br>
<br>
Respectfully,<br>
<br>
Ruben Pinedo<br>
Computer Information Systems<br>
College of Business Administration<br>
University of Texas at El Paso<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://mail.python.org/pipermail/tutor/attachments/20131017/ea525e7b/attachment.html" target="_blank">http://mail.python.org/pipermail/tutor/attachments/20131017/ea525e7b/attachment.html</a>><br>
<br>
------------------------------<br>
<br>
Subject: Digest Footer<br>
<br>
_______________________________________________<br>
Tutor maillist - <a href="mailto:Tutor@python.org">Tutor@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/tutor" target="_blank">https://mail.python.org/mailman/listinfo/tutor</a><br>
<br>
<br>
------------------------------<br>
<br>
End of Tutor Digest, Vol 116, Issue 37<br>
**************************************<br>
</blockquote></div><br></div></div>