Need Help with Programming Science Project
Denis McMahon
denismfmcmahon at gmail.com
Sat Jan 25 06:31:08 EST 2014
On Fri, 24 Jan 2014 20:58:50 -0800, theguy wrote:
> I know. I'm kind of ashamed of the code, but it does the job I need it
> to up to a certain point
OK, well first of all take a step back and look at the problem.
You have n exemplars, each from a known author.
You analyse each exemplar, and determine some statistics for it.
You then take your unknown sample, determine the same statistics for the
unknown sample.
Finally, you compare each exemplar's stats with the sample's stats to try
and find a best match.
So, perhaps you want a dictionary of { author: statistics }, and a
function to analyse a piece of text, which might call other functions to
get eg avg words / sentence, avg letters / sentence, avg word length, and
the sd in each, and the short word ratio (words <= 3 chars vs words >= 4
chars) and some other statistics.
Given the statistics for each exemplar, you might store these in your
dictionary as a tuple.
this isn't python, it's a description of an algorithm, it just looks a
bit pythonic:
# tuple of weightings applied to different stats
stat_weightings = ( 1.0, 1.3, 0.85, ...... )
def get_some_stat( t ):
# calculate some numerical statistic on a block of text
# return it
def analyse( f ):
text = read_file( f )
return ( get_some_stat( text ), ...... )
exemplars = {}
for exemplar_file in exemplar_files:
exemplar_data[author] = analyse( exemplar_file )
sample_data = analyse( sample_file )
scores = {}
tmp = 0
x = 0
# score for a piece of work is sum of ( diff of stat * weighting )
# for all the stats, lower score = closer match
for author in keys( exemplar_data ):
for i in len( exemplar_data[ author ] ):
tmp = tmp + sqrt( exemplar_data[ author ][ i ] -
sample_data[ i ] ) * stat_weightings( i )
scores[ author ] = tmp
if tmp > x:
x = tmp
names = []
for author in keys( scores ):
if scores[ author ] < x:
x = scores[ author ]
names = [ author ]
elif scores[ author ] == x:
names.append( [ author ] )
print "the best matching author(s) is/are: ", names
Then all you have to do is find enough ways to calculate stats, and the
magic coefficients to use in the stat_weightings
--
Denis McMahon, denismfmcmahon at gmail.com
More information about the Python-list
mailing list