[Tutor] Tutor Digest, Vol 116, Issue 37
Siva Cn
cnsiva.in at gmail.com
Thu Oct 17 17:49:49 CEST 2013
I guess this may help you
--------------------------
import operator
from string import whitespace as space
from string import punctuation as punc
class TextProcessing(object):
"""."""
def __init__(self):
"""."""
self.file = None
self.sorted_list = []
self.words_and_occurence = {}
def __sort_dict_by_value(self):
"""."""
sorted_in_rev = sorted(self.words_and_occurence.items(), key=lambda
x: x[1])
self.sorted_list = sorted_in_rev[::-1]
def __validate_words(self, word):
"""."""
if word in self.words_and_occurence:
self.words_and_occurence[word] += 1
else:
self.words_and_occurence[word] = 1
def __parse_file(self, file_name):
"""."""
fp = open(file_name, 'r')
line = fp.readline()
while line:
split_line = [self.__validate_words(word.strip(punc + space)) \
for word in line.split()
if word.strip(punc + space)]
line = fp.readline()
fp.close()
def parse_file(self, file_name=None):
"""."""
if file_name is None:
raise Exception("Please pass the file to be parsed")
if not file_name.endswith(r".txt"):
raise Exception("*** Error *** Not a valid text file")
self.__parse_file(file_name)
self.__sort_dict_by_value()
def print_top_n(self, n):
"""."""
print "Top {0} words:".format(n), [self.sorted_list[i][0] for i in
xrange(n)]
def print_unique_words(self):
"""."""
print "Unique words:", [self.sorted_list[i][0] for i in
xrange(len(self.sorted_list))]
if __name__ == "__main__":
"""."""
obj = TextProcessing()
obj.parse_file(r'test_input.txt')
obj.print_top_n(4)
obj.print_unique_words()
*-- Regards --*
*
*
* Siva Cn*
*Python Developer*
*
*
*+91 9620339598*
*http://www.cnsiva.com*
---------------------
On Thu, Oct 17, 2013 at 7:58 PM, <tutor-request at python.org> wrote:
> Send Tutor mailing list submissions to
> tutor at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mail.python.org/mailman/listinfo/tutor
> or, via email, send a message with subject or body 'help' to
> tutor-request at python.org
>
> You can reach the person managing the list at
> tutor-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Tutor digest..."
>
>
> Today's Topics:
>
> 1. Re: Help please (Alan Gauld)
> 2. Re: Help please (Peter Otten)
> 3. Re: Help please (Dominik George)
> 4. Re: Help please (Kengesbayev, Askar)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 17 Oct 2013 14:13:07 +0100
> From: Alan Gauld <alan.gauld at btinternet.com>
> To: tutor at python.org
> Subject: Re: [Tutor] Help please
> Message-ID: <l3onop$oin$1 at ger.gmane.org>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> On 16/10/13 19:49, Pinedo, Ruben A wrote:
> > I was given this code and I need to modify it so that it will:
> >
> > #1. Error handling for the files to ensure reading only .txt file
>
> I'm not sure what is meant here since your code only ever opens
> 'emma.txt', so it is presumably a text file... Or are you
> supposed to make the filename a user provided value maybe
> (using raw_input maybe?)
>
> > #2. Print a range of top words... ex: print top 10-20 words
>
> I assume 'top' here means the most common? Whoever is writing the
> specification for this problem needs to be a bit more specific
> in their definitions.
>
> If so you need to fix the bugs in process_line() and
> process_file(). I don;t know if these are deliberate bugs
> or somebody is just sloppy. But neither work as expected
> right now. (Hint: Consider the return values of each)
>
> Once you've done that you can figure out how to extract
> the required number of words from your (unsorted) dictionary.
> and put that in a reporting function and print the output.
> You might be able to use the two common words functions,
> although watch out because they don't do exactly what
> you want and one of them is basically broken...
>
> > #3. Print only the words with > 3 characters
>
> Modify the above to discard words of 3 letters or less.
>
> > #4. Modify the printing function to print top 1 or 2 or 3 ....
>
> I assume this means take a parameter that speciffies the
> number of words to print. Or it could be the length of
> word to ignore. Again the specification is woolly
> In either case its a small modification to your
> reporting function.
>
> > #5. How many unique words are there in the book of length 1, 2, 3 etc
>
> This is slicing the data slightly differently but
> again not that different to the earlier requirement.
>
> > I am fairly new to python and am completely lost, i looked in my book as
> > to how to do number one but i cannot figure out what to modify and/or
> > delete to add the print selection. This is the code:
>
> You need to modify the two brokemn functions and add a
> new reporting function. (Despite the reference to a
> printing function I'd suggest keeping the data extraction
> and printing seperate.
>
> > import string
> >
> > def process_file(filename):
> > hist = dict()
> > fp = open(filename)
> > for line in fp:
> > process_line(line, hist)
> > return hist
> >
> > def process_line(line, hist):
> > line = line.replace('-', ' ')
> > for word in line.split():
> > word = word.strip(string.punctuation + string.whitespace)
> > word = word.lower()
> > hist[word] = hist.get(word, 0) + 1
> >
> > def common_words(hist):
> > t = []
> > for key, value in hist.items():
> > t.append((value, key))
> > t.sort(reverse=True)
> > return t
> >
> > def most_common_words(hist, num=100):
> > t = common_words(hist)
> > print 'The most common words are:'
> > for freq, word in t[:num]:
> > print freq, '\t', word
> > hist = process_file('emma.txt')
> > print 'Total num of Words:', sum(hist.values())
> > print 'Total num of Unique Words:', len(hist)
> > most_common_words(hist, 50)
> >
> > Any help would be greatly appreciated because i am struggling in this
> > class. Thank you in advance
>
> hth
> --
> Alan G
> Author of the Learn to Program web site
> http://www.alan-g.me.uk/
> http://www.flickr.com/photos/alangauldphotos
>
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 17 Oct 2013 15:37:49 +0200
> From: Peter Otten <__peter__ at web.de>
> To: tutor at python.org
> Subject: Re: [Tutor] Help please
> Message-ID: <l3op59$8n6$1 at ger.gmane.org>
> Content-Type: text/plain; charset="ISO-8859-1"
>
> Alan Gauld wrote:
>
> [Ruben Pinedo]
>
> > def process_file(filename):
> > hist = dict()
> > fp = open(filename)
> > for line in fp:
> > process_line(line, hist)
> > return hist
> >
> > def process_line(line, hist):
> > line = line.replace('-', ' ')
> >
> > for word in line.split():
> > word = word.strip(string.punctuation + string.whitespace)
> > word = word.lower()
> >
> > hist[word] = hist.get(word, 0) + 1
>
> [Alan Gauld]
>
> > If so you need to fix the bugs in process_line() and
> > process_file(). I don;t know if these are deliberate bugs
> > or somebody is just sloppy. But neither work as expected
> > right now. (Hint: Consider the return values of each)
>
> I fail to see the bug.
>
> process_line() mutates its `hist` argument, so there's no need to return
> something. Or did you mean something else that escapes me?
>
>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 17 Oct 2013 16:17:27 +0200
> From: Dominik George <nik at naturalnet.de>
> To: Todd Matsumoto <c.t.matsumoto at gmail.com>,tutor at python.org
> Subject: Re: [Tutor] Help please
> Message-ID: <f310f0be-858d-48e2-ae88-5ad720518888 at email.android.com>
> Content-Type: text/plain; charset=UTF-8
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> Todd Matsumoto <c.t.matsumoto at gmail.com> schrieb:
> >> #1. Error handling for the files to ensure reading only .txt file
> >Look up exceptions.
> >Find out what the string method endswith() does.
>
> One should note that the OP probably meant files of the type text/plain
> rather than .txt files. File name extensions are a convenience to identify
> a file on first glance, but they tell absolutely nothing about the contents.
>
> So, look up MIME types as well ;)!
>
> - -nik
> -----BEGIN PGP SIGNATURE-----
> Version: APG v1.0.8-fdroid
>
> iQFNBAEBCgA3BQJSX/F3MBxEb21pbmlrIEdlb3JnZSAobW9iaWxlIGtleSkgPG5p
> a0BuYXR1cmFsbmV0LmRlPgAKCRAvLbGk0zMOJZxHB/9TGh6F1vRzgZmSMHt48arc
> jruTRfvOK9TZ5MWm6L2ZpxqKr3zBP7KSf1ZWSeXIovat9LetETkEwZ9bzHBuN8Ve
> m8YsOVX3zR6VWqGkRYYer3MbWo9DCONlJUKGMs/qjB180yxxhQ12Iw9WAHqam1Ti
> n0CCWsf4l5B3WBe+t2aTOlQNmo//6RuBK1LfCrnYX0XV2Catv1075am0KaTvbxfB
> rfHHnR4tdIYmZ8P/SkO3t+9JzJU9e+H2W90++K9EkMTBJxUhsa4AuZIEr8WqEfSe
> EheQMUp23tlMgKRp6UHiRJBljEsQJ0XFuYa+zj6hXCXoru/9ReHTRWcvJEpfXxEC
> =hJ0m
> -----END PGP SIGNATURE-----
>
>
>
> ------------------------------
>
> Message: 4
> Date: Thu, 17 Oct 2013 14:21:17 +0000
> From: "Kengesbayev, Askar" <askar.kengesbayev at etrade.com>
> To: "Pinedo, Ruben A" <rapinedo at miners.utep.edu>, "tutor at python.org"
> <tutor at python.org>
> Subject: Re: [Tutor] Help please
> Message-ID:
> <
> 6FAD14604B087B438F6FF64D9875A40C68F5ADCA at atl1ex10mbx4.corp.etradegrp.com>
>
> Content-Type: text/plain; charset="us-ascii"
>
> Ruben,
>
> #1 you can try something like this
> try:
> with open('my_file.txt') as file:
> pass
> except IOError as e:
> print "Unable to open file" #Does not exist or you do not have
> read permission
>
> #2. I would try to use regular expression push words to array and then you
> can manipulate array. Not sure if it is efficient way but it should work.
> #3 . easy way would be to use regular expression. Re module.
> #4. Once you will have array in #2 you can sort it and print whatever top
> words you need.
> #5. I am not sure the best way on this but you can play with array from
> #2.
>
> Thanks,
> Askar
>
> From: Pinedo, Ruben A [mailto:rapinedo at miners.utep.edu]
> Sent: Wednesday, October 16, 2013 2:49 PM
> To: tutor at python.org
> Subject: [Tutor] Help please
>
> I was given this code and I need to modify it so that it will:
>
> #1. Error handling for the files to ensure reading only .txt file
> #2. Print a range of top words... ex: print top 10-20 words
> #3. Print only the words with > 3 characters
> #4. Modify the printing function to print top 1 or 2 or 3 ....
> #5. How many unique words are there in the book of length 1, 2, 3 etc
>
> I am fairly new to python and am completely lost, i looked in my book as
> to how to do number one but i cannot figure out what to modify and/or
> delete to add the print selection. This is the code:
>
>
> import string
>
> def process_file(filename):
> hist = dict()
> fp = open(filename)
> for line in fp:
> process_line(line, hist)
> return hist
>
> def process_line(line, hist):
> line = line.replace('-', ' ')
>
> for word in line.split():
> word = word.strip(string.punctuation + string.whitespace)
> word = word.lower()
>
> hist[word] = hist.get(word, 0) + 1
>
> def common_words(hist):
> t = []
> for key, value in hist.items():
> t.append((value, key))
>
> t.sort(reverse=True)
> return t
>
> def most_common_words(hist, num=100):
> t = common_words(hist)
> print 'The most common words are:'
> for freq, word in t[:num]:
> print freq, '\t', word
>
> hist = process_file('emma.txt')
> print 'Total num of Words:', sum(hist.values())
> print 'Total num of Unique Words:', len(hist)
> most_common_words(hist, 50)
>
> Any help would be greatly appreciated because i am struggling in this
> class. Thank you in advance
>
> Respectfully,
>
> Ruben Pinedo
> Computer Information Systems
> College of Business Administration
> University of Texas at El Paso
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mail.python.org/pipermail/tutor/attachments/20131017/ea525e7b/attachment.html
> >
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> https://mail.python.org/mailman/listinfo/tutor
>
>
> ------------------------------
>
> End of Tutor Digest, Vol 116, Issue 37
> **************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20131017/cad455ba/attachment-0001.html>
More information about the Tutor
mailing list