[Tutor] Tutor Digest, Vol 116, Issue 37

Siva Cn cnsiva.in at gmail.com
Thu Oct 17 17:49:49 CEST 2013


I guess this may help you
--------------------------


import operator

from string import whitespace as space
from string import punctuation as punc

class TextProcessing(object):
    """."""
    def __init__(self):
        """."""
        self.file = None
        self.sorted_list = []
        self.words_and_occurence = {}

    def __sort_dict_by_value(self):
        """."""
        sorted_in_rev = sorted(self.words_and_occurence.items(), key=lambda
x: x[1])
        self.sorted_list = sorted_in_rev[::-1]

    def __validate_words(self, word):
        """."""
        if word in self.words_and_occurence:
            self.words_and_occurence[word] += 1
        else:
            self.words_and_occurence[word] = 1

    def __parse_file(self, file_name):
        """."""
        fp = open(file_name, 'r')
        line = fp.readline()
        while line:
            split_line = [self.__validate_words(word.strip(punc + space)) \
                          for word in line.split()
                          if word.strip(punc + space)]

            line = fp.readline()
        fp.close()

    def parse_file(self, file_name=None):
        """."""
        if file_name is None:
            raise Exception("Please pass the file to be parsed")
        if not file_name.endswith(r".txt"):
            raise Exception("*** Error *** Not a valid text file")

        self.__parse_file(file_name)

        self.__sort_dict_by_value()

    def print_top_n(self, n):
        """."""
        print "Top {0} words:".format(n), [self.sorted_list[i][0] for i in
xrange(n)]

    def print_unique_words(self):
        """."""
        print "Unique words:", [self.sorted_list[i][0] for i in
xrange(len(self.sorted_list))]

if __name__ == "__main__":
    """."""
    obj = TextProcessing()
    obj.parse_file(r'test_input.txt')
    obj.print_top_n(4)
    obj.print_unique_words()





*-- Regards --*
*
*
*   Siva Cn*
*Python Developer*
*
*
*+91 9620339598*
*http://www.cnsiva.com*
---------------------


On Thu, Oct 17, 2013 at 7:58 PM, <tutor-request at python.org> wrote:

> Send Tutor mailing list submissions to
>         tutor at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/tutor
> or, via email, send a message with subject or body 'help' to
>         tutor-request at python.org
>
> You can reach the person managing the list at
>         tutor-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Tutor digest..."
>
>
> Today's Topics:
>
>    1. Re: Help please (Alan Gauld)
>    2. Re: Help please (Peter Otten)
>    3. Re: Help please (Dominik George)
>    4. Re: Help please (Kengesbayev, Askar)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 17 Oct 2013 14:13:07 +0100
> From: Alan Gauld <alan.gauld at btinternet.com>
> To: tutor at python.org
> Subject: Re: [Tutor] Help please
> Message-ID: <l3onop$oin$1 at ger.gmane.org>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> On 16/10/13 19:49, Pinedo, Ruben A wrote:
> > I was given this code and I need to modify it so that it will:
> >
> > #1. Error handling for the files to ensure reading only .txt file
>
> I'm not sure what is meant here since your code only ever opens
> 'emma.txt', so it is presumably a text file... Or are you
> supposed to make the filename a user provided value maybe
> (using raw_input maybe?)
>
> > #2. Print a range of top words... ex: print top 10-20 words
>
> I assume 'top' here means the most common? Whoever is writing the
> specification for this problem needs to be a bit more specific
> in their definitions.
>
> If so you need to fix the bugs in process_line() and
> process_file(). I don;t know if these are deliberate bugs
> or somebody is just sloppy. But neither work as expected
> right now. (Hint: Consider the return values of each)
>
> Once you've done that you can figure out how to extract
> the required number of words from your (unsorted) dictionary.
> and put that in a reporting function and print the output.
> You might be able to use the two common words functions,
> although watch out because they don't do exactly what
> you want and one of them is basically broken...
>
> > #3. Print only the words with > 3 characters
>
> Modify the above to discard words of 3 letters or less.
>
> > #4. Modify the printing function to print top 1 or 2 or 3 ....
>
> I assume this means take a parameter that speciffies the
> number of words to print. Or it could be the length of
> word to ignore. Again the specification is woolly
> In either case its a small modification to your
> reporting function.
>
> > #5. How many unique words are there in the book of length 1, 2, 3 etc
>
> This is slicing the data slightly differently but
> again not that different to the earlier requirement.
>
> > I am fairly new to python and am completely lost, i looked in my book as
> > to how to do number one but i cannot figure out what to modify and/or
> > delete to add the print selection. This is the code:
>
> You need to modify the two brokemn functions and add a
> new reporting function. (Despite the reference to a
> printing function I'd suggest keeping the data extraction
> and printing seperate.
>
> > import string
> >
> > def process_file(filename):
> >      hist = dict()
> >      fp = open(filename)
> >      for line in fp:
> >          process_line(line, hist)
> >      return hist
> >
> > def process_line(line, hist):
> >      line = line.replace('-', ' ')
> >      for word in line.split():
> >          word = word.strip(string.punctuation + string.whitespace)
> >          word = word.lower()
> >          hist[word] = hist.get(word, 0) + 1
> >
> > def common_words(hist):
> >      t = []
> >      for key, value in hist.items():
> >          t.append((value, key))
> >      t.sort(reverse=True)
> >      return t
> >
> > def most_common_words(hist, num=100):
> >      t = common_words(hist)
> >      print 'The most common words are:'
> >      for freq, word in t[:num]:
> >          print freq, '\t', word
> > hist = process_file('emma.txt')
> > print 'Total num of Words:', sum(hist.values())
> > print 'Total num of Unique Words:', len(hist)
> > most_common_words(hist, 50)
> >
> > Any help would be greatly appreciated because i am struggling in this
> > class. Thank you in advance
>
> hth
> --
> Alan G
> Author of the Learn to Program web site
> http://www.alan-g.me.uk/
> http://www.flickr.com/photos/alangauldphotos
>
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 17 Oct 2013 15:37:49 +0200
> From: Peter Otten <__peter__ at web.de>
> To: tutor at python.org
> Subject: Re: [Tutor] Help please
> Message-ID: <l3op59$8n6$1 at ger.gmane.org>
> Content-Type: text/plain; charset="ISO-8859-1"
>
> Alan Gauld wrote:
>
> [Ruben Pinedo]
>
> > def process_file(filename):
> >     hist = dict()
> >     fp = open(filename)
> >     for line in fp:
> >         process_line(line, hist)
> >     return hist
> >
> > def process_line(line, hist):
> >     line = line.replace('-', ' ')
> >
> >     for word in line.split():
> >         word = word.strip(string.punctuation + string.whitespace)
> >         word = word.lower()
> >
> >         hist[word] = hist.get(word, 0) + 1
>
> [Alan Gauld]
>
> > If so you need to fix the bugs in process_line() and
> > process_file(). I don;t know if these are deliberate bugs
> > or somebody is just sloppy. But neither work as expected
> > right now. (Hint: Consider the return values of each)
>
> I fail to see the bug.
>
> process_line() mutates its `hist` argument, so there's no need to return
> something. Or did you mean something else that escapes me?
>
>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 17 Oct 2013 16:17:27 +0200
> From: Dominik George <nik at naturalnet.de>
> To: Todd Matsumoto <c.t.matsumoto at gmail.com>,tutor at python.org
> Subject: Re: [Tutor] Help please
> Message-ID: <f310f0be-858d-48e2-ae88-5ad720518888 at email.android.com>
> Content-Type: text/plain; charset=UTF-8
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> Todd Matsumoto <c.t.matsumoto at gmail.com> schrieb:
> >> #1. Error handling for the files to ensure reading only .txt file
> >Look up exceptions.
> >Find out what the string method endswith() does.
>
> One should note that the OP probably meant files of the type text/plain
> rather than .txt files. File name extensions are a convenience to identify
> a file on first glance, but they tell absolutely nothing about the contents.
>
> So, look up MIME types as well ;)!
>
> - -nik
> -----BEGIN PGP SIGNATURE-----
> Version: APG v1.0.8-fdroid
>
> iQFNBAEBCgA3BQJSX/F3MBxEb21pbmlrIEdlb3JnZSAobW9iaWxlIGtleSkgPG5p
> a0BuYXR1cmFsbmV0LmRlPgAKCRAvLbGk0zMOJZxHB/9TGh6F1vRzgZmSMHt48arc
> jruTRfvOK9TZ5MWm6L2ZpxqKr3zBP7KSf1ZWSeXIovat9LetETkEwZ9bzHBuN8Ve
> m8YsOVX3zR6VWqGkRYYer3MbWo9DCONlJUKGMs/qjB180yxxhQ12Iw9WAHqam1Ti
> n0CCWsf4l5B3WBe+t2aTOlQNmo//6RuBK1LfCrnYX0XV2Catv1075am0KaTvbxfB
> rfHHnR4tdIYmZ8P/SkO3t+9JzJU9e+H2W90++K9EkMTBJxUhsa4AuZIEr8WqEfSe
> EheQMUp23tlMgKRp6UHiRJBljEsQJ0XFuYa+zj6hXCXoru/9ReHTRWcvJEpfXxEC
> =hJ0m
> -----END PGP SIGNATURE-----
>
>
>
> ------------------------------
>
> Message: 4
> Date: Thu, 17 Oct 2013 14:21:17 +0000
> From: "Kengesbayev, Askar" <askar.kengesbayev at etrade.com>
> To: "Pinedo, Ruben A" <rapinedo at miners.utep.edu>, "tutor at python.org"
>         <tutor at python.org>
> Subject: Re: [Tutor] Help please
> Message-ID:
>         <
> 6FAD14604B087B438F6FF64D9875A40C68F5ADCA at atl1ex10mbx4.corp.etradegrp.com>
>
> Content-Type: text/plain; charset="us-ascii"
>
> Ruben,
>
> #1 you can try something like this
>   try:
>         with open('my_file.txt') as file:
>             pass
>     except IOError as e:
>         print "Unable to open file"  #Does not exist or you do not have
> read permission
>
> #2. I would try to use regular expression push words to array and then you
> can manipulate array. Not sure if it is efficient way but it should work.
> #3 . easy way would be to use regular expression. Re module.
> #4. Once you will have array in #2 you can sort it and print whatever top
> words you need.
> #5.  I am not sure the best way on this but you can play with array from
> #2.
>
> Thanks,
> Askar
>
> From: Pinedo, Ruben A [mailto:rapinedo at miners.utep.edu]
> Sent: Wednesday, October 16, 2013 2:49 PM
> To: tutor at python.org
> Subject: [Tutor] Help please
>
> I was given this code and I need to modify it so that it will:
>
> #1. Error handling for the files to ensure reading only .txt file
> #2. Print a range of top words... ex: print top 10-20 words
> #3. Print only the words with > 3 characters
> #4. Modify the printing function to print top 1 or 2 or 3 ....
> #5. How many unique words are there in the book of length 1, 2, 3 etc
>
> I am fairly new to python and am completely lost, i looked in my book as
> to how to do number one but i cannot figure out what to modify and/or
> delete to add the print selection. This is the code:
>
>
> import string
>
> def process_file(filename):
>     hist = dict()
>     fp = open(filename)
>     for line in fp:
>         process_line(line, hist)
>     return hist
>
> def process_line(line, hist):
>     line = line.replace('-', ' ')
>
>     for word in line.split():
>         word = word.strip(string.punctuation + string.whitespace)
>         word = word.lower()
>
>         hist[word] = hist.get(word, 0) + 1
>
> def common_words(hist):
>     t = []
>     for key, value in hist.items():
>         t.append((value, key))
>
>     t.sort(reverse=True)
>     return t
>
> def most_common_words(hist, num=100):
>     t = common_words(hist)
>     print 'The most common words are:'
>     for freq, word in t[:num]:
>         print freq, '\t', word
>
> hist = process_file('emma.txt')
> print 'Total num of Words:', sum(hist.values())
> print 'Total num of Unique Words:', len(hist)
> most_common_words(hist, 50)
>
> Any help would be greatly appreciated because i am struggling in this
> class. Thank you in advance
>
> Respectfully,
>
> Ruben Pinedo
> Computer Information Systems
> College of Business Administration
> University of Texas at El Paso
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mail.python.org/pipermail/tutor/attachments/20131017/ea525e7b/attachment.html
> >
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> https://mail.python.org/mailman/listinfo/tutor
>
>
> ------------------------------
>
> End of Tutor Digest, Vol 116, Issue 37
> **************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20131017/cad455ba/attachment-0001.html>


More information about the Tutor mailing list