[Tutor] Review and criticism of python project

Sat Jan 5 02:32:06 CET 2008

Cool! Code! (gullom voice - we must look at code, yesss....)

> Hi there.
>
> What this section area does is takes a data file that is comma separated 
> and
> imports - there is a unique ID in the first field and a code in the second
> field that corresponds to a certain section of information. What I need 
> from
> this is for the process to role up against unique ID all section holdings
> withot duplicates, report on section combinations, and overal section
> counts. In addtion I need the ability to assigna value for page count to
> these sections and have the ability to uploada translation file just in 
> case
> a section is identiifed by multiple values that needs to be normalized to 
> a
> single unique value.
>
> Sorry for the lengthly code response - all commenst are appreciated - as
> mentioned I am quite new with Python - it is doing what I need it to do 
> but
> I think that it is a mess and needs to be cleaned up a little.
>
> Thanks for any comments.
>
> GTXY20
>
> import sys
> import os
> class __analysis:
>    def __init__(self):
>        print '***Analysis Tool***'
>        datafile=raw_input('data file name:')
>        self.datafile=datafile
>        self.parsefile()

Ouch. Usually in OOP, one never puts any user interaction into a class. 
Anytime you want to make an __analysis object, code execution is halted unto 
the user responds.... Yuck. If necessary, then make the datafile a parameter 
you pass in and do the raw_input outside of the class definition before you 
instantiate an object.
As an aside... setting the filename to datafile and then datafile directly 
to self.datafile is a little wasteful when you can set it directly to 
self.datafile

> # script to import unitID section data and section page count reference 
> and
> create a sorted dictionary
> # where in uhdata{} key=unitID and value=unitID section holdings
> # where in pgcnt{} key=Section and value=page count
>
>    def parsefile(self):
>        try:
>            uhdatafile = open(self.datafile, 'r')
>            records = uhdatafile.read()
>            uhdatafile.close()
>            lines = records.split()

Ummm this will split on every space also. Using the name 'lines' as an 
indicator I think you were expecting this to actually split on newline 
characters, so instead you should use uhdatafile.readlines()

>            self.uhdata={}
>            for line in lines:

Here you iterate over a created list. It is easier and more efficient to 
iterate over the file itself.
for line in uhdatafile:
   do_stuff()
uhdatafile.close()

>                uh, tf = line.split(',')
>                if uh in self.uhdata:
>                    f=self.uhdata[uh]
>                    if tf not in f:
>                        f.append(tf)
>                else:
>                    self.uhdata[uh]=[tf]

This can read

try:
  f = self.uhdata[uh]
except KeyError:
  self.uhdata[uh] = []
finally:
  self.uhdata[uh].append(tf)

Also note that there is no error checking in the algorithm anywhere here. 
The file must be exactly formatted or it will create an error.

>            for uh, Sections in self.uhdata.items():
>                Sections.sort()

This will not work. In the documentation, it says that dictionary object 
return a *copy* of the (key,value) pairs.
You are sorting those *copies*.  Also sinces uh is not used here 
itervalues() is sufficient.

for Sections in self.uhdata.itervalues():
  Sections.sort()

>        except IOError:
>            print 'file not found check file name'
>            analysis()
>
>        ftranslateok=raw_input('would you like to translate section codes?
> (y/n):')
>        if ftranslateok == 'y':
>            self.transFn()
>        else:
>            pass

This is unnecessary, code execution will continue whether you use else: pass 
or not.

>        pgcountok=raw_input('would you like to assign section page counts?
> (y/n):')
>        if pgcountok == 'y':
>            self.setPageCounts()
>        else:
>            missingpgcounts={}
>            fmissingpgcounts=[]
>            for x in self.uhdata:
>                for f in self.uhdata[x]:
>                    if f not in fmissingpgcounts:
>                        fmissingpgcounts.append(f)

fmissingpgcounts = [f for f in self.uhdata.itervalues() if f not in 
fmissingpgcounts]
Ahhh... This looks like a set.
fmissingpgcounts = set(self.uhdata.itervalues())

>            for x in fmissingpgcounts:
>                missingpgcounts[x]=0
>            self.pgcounts = missingpgcounts

Oh my. So we have a dictionary, create a unique list of the values and then 
create another dictionary from that?
The intermediate variable missingpgcounts is not necessary.

self.pgcounts = dict((x,0) for x in fmissingpgcounts)

>        fdistmodel=raw_input('would you like to define max section
> distribution cut off? (y/n):')
>        if fdistmodel == 'y':
>            self.fdistmax=raw_input('what is the max distributions before a
> full book?:')
>            self.fdistmax=int(self.fdistmax)
>            self.Sectiondistmax()
>        else:
>            self.fdistmax=1000000000
>            self.Sectiondistmax()

self.Sectiondistmax() needs to be typed only once unindented after the 
if/else block.
If self.fdistmax will never be zero, then you can do this...

self.fdistmax = int(self.fdistmax) or 1000000000
Oh, and if the user presses enter without puttting in a number then your 
program will exit with
an error. If you want you can use this to your advantage. (Press enter for 
default of all)

try:
    self.fdistmax = int(self.fdistmax)
except ValueError:
    self.fdistmax = False

False is a better flag here. Though you will never have a value of a 
billion, you could never, ever possibly have a value of False. (0)
Also helps readability. See below.

>        sys.exit(1)
>
> # function to determine number of uniqueID for each section
>    def Sectionqty(self):
>        Sectionqtyoutfile = open('Sectionqty.txt', 'w+')
>        Sectionqtyoutfile.write ('Section\tQTY\n')
>        from collections import defaultdict
>        fcounts=defaultdict(int)
>        flst=[]
>        flst2=[]

Okay this whole section can be rewritten. The reason is because the two 
blocks are basically the same thing with the one stipulation based on a 
flag. I will attempt to rewrite this so the code is not copied.

>        if self.fdistmax == 1000000000:
>            for v in self.uhdata.values():
>                for item in v:
>                    fcounts[item]+=1
>
>            for k,v in sorted(fcounts.items()):
>                Section=k
>                fqty=v
>                Sectionqtyoutfile.write ('%s\t%s\n' % (Section, fqty))
>
>        else:
>            for k,v in self.uhdata.items():
>                if len(v)<=self.fdistmax:
>                    flst.append(self.uhdata[k])
>            for i in flst:
>                for x in i:
>                    flst2.append(x)
>            for Sections in flst2:
>                    fcounts[Sections]+=1
>            for k,v in sorted(fcounts.items()):
>                Section= k
>                fqty= v
>                Sectionqtyoutfile.write ('%s\t%s\n' % (Section, fqty))

for k,v in self.uhdata.items():
  if self.fdistmax or len(v)<=self.fdistmax:
    for x in self.uhdata[k]:
      fcounts[x]+=1
for (Section, fqty) in sorted(fcounts.items()):
  Sectionqtyoutfile.write('%s\t%s\n'%(Section,fqty))

>        Sectionqtyoutfile.close()
>        self.SectionCombqty()
>
> # function to determine number of uniqueID section combinations and
> associated section page counts
>    def SectionCombqty(self):
>        SectionCombqtyoutfile = open('SectionCombqty.txt', 'w+')
>        SectionCombqtyoutfile.write('Combination Qty\tNumber of
> Sections\tCombination\tCombinationPageCount\tTotalPages\n')
>        fullbook = 'Full Book'
>        fgreater=[]
>        fcheck=0
>        from collections import defaultdict
>        fcomb=defaultdict(int)
>        for uh in self.uhdata.keys():
>            fcomblst=self.uhdata[uh]
>            fcomb[tuple(fcomblst)]+=1

for value in self.uhdata.itervalues():
  fcomb[tuple(value)]+=1

>        if self.fdistmax == 1000000000:
>            for count, items in sorted( ((v,k) for k,v in fcomb.items
> ()),reverse=True):
>                fpgcounts = sum([self.pgcounts.get(i,i) for i in
> list(items)])
>                Sectioncomb = ','.join(items)
>                holdings = len(items)
>                totpgcounts = count*fpgcounts
>                SectionCombqtyoutfile.write ('%s\t%s\t%s\t%s\t%s\n' %
> (count,holdings,Sectioncomb,fpgcounts,totpgcounts))
>
>        else:
>            for count, items in sorted( ((v,k) for k,v in fcomb.items
> ()),reverse=True):
>                if len(items) <= self.fdistmax:
>                    fpgcounts = sum([self.pgcounts.get(i,i) for i in
> list(items)])
>                    Sectioncomb = ','.join(items)
>                    holdings = len(items)
>                    totpgcounts = count*fpgcounts
>                    SectionCombqtyoutfile.write ('%s\t%s\t%s\t%s\t%s\n' %
> (count,holdings,Sectioncomb,fpgcounts,totpgcounts))
>            for count, items in sorted( ((v,k) for k,v in fcomb.items
> ()),reverse=True):
>                if len(items)>self.fdistmax:
>                    fgreater.append(count)
>
>            fcheck=sum(fgreater)
>            SectionCombqtyoutfile.write ('%s\t''>''%s\t%s\t%s\t%s\n' %
> (fcheck,self.fdistmax,fullbook,fullbook,fullbook))

See what I did to the section above to see how to make this one block 
instead of two copy&paste blocks.

>
>        SectionCombqtyoutfile.close()
>
> # where in pgcnt{} key=Section and value=page count
>    def setPageCounts(self):
>        pagecountfile=raw_input('page count file name:')
>        self.pagecountfile=pagecountfile
>        try:
>            pagecountinfile = open(self.pagecountfile, 'r')
>            records = pagecountinfile.read()
>            pagecountinfile.close()
>            self.pgcounts={}
>            lines = records.split()
>            for line in lines:
>                fpg, cnt = line.split(',')
>                self.pgcounts[fpg]=int(cnt)

This is very similar to the block at the beginning of this post. Make the 
same changes here.

>        except IOError:
>            print 'file not found check file name'
>            analysis()
>
> # function to determine number of uniqueID distributions and associated
> Sections held
>    def Sectiondistmax(self):
>        from collections import defaultdict
>        Sectiondistoutfile = open('Sectiondist.txt', 'w+')
>        Sectiondistoutfile.write ('SectionDistributions\tQTY\n')
>        fgreater=[]
>        fullbook = "Full Book"
>        fcheck=0
>        fcount=defaultdict(int)
>        for uh in self.uhdata.keys():
>            f=self.uhdata[uh]
>            fcount[len(f)]+=1
>        if self.fdistmax == 1000000000:
>            for k,v in sorted(fcount.items()):

You don't have to name these k,v. Any name is fine like:
for fdist, fqty in fcount.items(sorted=True):

>                fdist=k
>                fqty=v
>                Sectiondistoutfile.write ('%s\t%s\n' % (fdist,fqty))
>
>        else:
>            for k,v in sorted(fcount.items()):
>                if k <= self.fdistmax:
>                    fdist=k
>                    fqty=v

Same name thing.

>                    Sectiondistoutfile.write ('%s\t%s\n' % (fdist,fqty))
>
>            for k,v in sorted(fcount.items()):
>                if k > self.fdistmax:
>                    fgreater.append(fcount[k])
>            fcheck=sum(fgreater)
>            Sectiondistoutfile.write ('%s\t%s\n' % (fullbook,fcheck))
>        Sectiondistoutfile.close()
>        self.Sectionqty()
>
> #function to translate UnitID Sectioncodes to normalized assigned Section
> code (e.g. parent and mulitple child section codes)
>    def transFn(self):
>        transfile=raw_input('Section translate file name:')
>        self.transfile=transfile
>        try:
>            transfilein=open(self.transfile, 'r')
>            records = transfilein.read()
>            transfilein.close()
>            lines = records.split()
>            transDict = {}
>            for line in lines:
>                key, value = line.split(',')
>                transDict[key] = value

Again, the same thing here as way up above.

>            for key, value in self.uhdata.items():
>                self.uhdata[key] = [ transDict.get(i, i) for i in value ]
>            for k in self.uhdata:
>                self.uhdata[k]=sorted(set(self.uhdata[k]))

Uhhh... Right.

>        except IOError:
>            print 'file not found check file name'
>            analysis()

Just some suggestions. A long piece of code for what you are trying to 
accomplish... It probably could be shortened, seeing how much duplicate code 
is within it. All I did was work through it block by block changing pieces 
so that they make a little more sense and are more efficient. I suspect that 
a rework of the complete design would be beneficial. I'm just too tired and 
busy to think for that long. ;-)

HTH,
JS