[Tutor] Is the difference in outputs with different size input lists due to limits on memory with PYTHON?

Fri May 7 14:31:08 CEST 2010

Art Kendall wrote:
>
>
> On 5/6/2010 8:52 PM, Dave Angel wrote:
>>>
>>
>> I got my own copy of the papers, at 
>> http://thomas.loc.gov/home/histdox/fedpaper.txt
>>
>> I copied your code, and added logic to it to initialize termlist from 
>> the actual file.  And it does complete the output file at 83 lines, 
>> approx 17000 columns per line (because most counts are one digit).  
>> It takes quite a while, and perhaps you weren't waiting for it to 
>> complete.  I'd suggest either adding a print to the loop, showing the 
>> count, and/or adding a line that prints "done" after the loop 
>> terminates normally.
>>
>> I watched memory usage, and as expected, it didn't get very high.  
>> There are things you need to redesign, however.  One is that all the 
>> punctuation and digits and such need to be converted to spaces.
>>
>>
>> DaveA
>>
>>
>
> Thank you for going the extra mile.
>
> I obtained my copy before I retired in 2001 and there are some 
> differences.  In the current copy from the LOC papers 7, 63, and 81 
> start with "FEDERALIST." (an extra period).  That explains why you 
> have 83. There also some comments such as attributed author.  After 
> the weekend, I'll do a file compare and see differences in more detail.
>
> Please email me your version of the code.  I'll try it as is.  Then 
> I'll put in a counter, have it print the count and paper number, and a 
> 'done' message.
>
> As a check after reading in the counts, I'll include the counts from 
> NoteTab and see if these counts sum to those from NoteTab.
>
> I'll use SPSS to create a version of the .txt file with punctuation 
> and numerals changed to spaces and try using that as the corpus.   
> Then I'll try to create a similar file with Python.
>
> Art
>
As long as you realize this is very rough.  I just wanted to prove there 
wasn't anything fundamentally wrong with your approach.  But there's 
still lots to do, especially with regards to cleaning up the text before 
and between the papers.  Anyway, here it is.

#!/usr/bin/env python

sourcedir = "data/"
outputdir = "results/"


# word counts: Federalist papers
import sys, os
import re, textwrap
#Create the output directory if it doesn't exist
if not os.path.exists(outputdir):
    os.makedirs(outputdir)

# read the combined file and split into individual papers
# later create a new version that deals with all files in a folder 
rather than having papers concatenated
alltext = file(sourcedir + "feder16.txt").readlines()

filtered = " ".join(alltext).lower()
for ch in ('" ' + ". , ' * - ( ) = @ [ ] ; . ` 1 2 3 4 5 6 7 8 9 0 > : / 
?").split():
    filtered = filtered.replace(ch, " ")
#todo:   make a better filter, such as keeping only letters, rather than 
replacing
#   specific characters

words = filtered.split()
print "raw word count is", len(words)

wordset = set(words)
print "wordset reduces it from/to", len(words), len(wordset)
#eliminate words shorter than 4 characters
words = sorted([word for word in wordset if len(word)>3])
del wordset    #free space of wordset
print "Eliminating words under 4 characters reduces it to", len(words)

#print the first 50
for word in words[:50]:
    print word


print "alltext is size", len(alltext)
papers= re.split(r'FEDERALIST No\.'," ".join(alltext))
print "Number of detected papers is ", len(papers)

#print first 50 characters of each, so we can see why some of them are 
missed
#   by our regex above
for index, paper in enumerate(papers):
    print index, "***", paper[:50]


countsfile = file(outputdir + "TermCounts.txt", "w")
syntaxfile = file(outputdir + "TermCounts.sps", "w")
# later create a python program that extracts all words instead of using 
NoteTab
#termfile   = open("allWords.txt")
#termlist = termfile.readlines()
#termlist = [item.rstrip("\n") for item in termlist]
#print "termlist is ", len(termlist)

termlist = words

# check for SPSS reserved words
varnames = textwrap.wrap(" ".join([v.lower() in ['and', 'or', 'not', 
'eq', 'ge',
'gt', 'le', 'lt', 'ne', 'all', 'by', 'to','with'] and (v+"_r") or v for 
v in termlist]))
syntaxfile.write("data list file= 
'c:/users/Art/desktop/fed/termcounts.txt' free/docnumber\n")
syntaxfile.writelines([v + "\n" for v in varnames])
syntaxfile.write(".\n")
# before using the syntax manually replace spaces internal to a string 
to underscore // replace (ltrtim(rtrim(varname))," ","_")   replace any 
special characters with @ in variable names


for p, paper in enumerate(papers):
   counts = []
   for t in termlist:
      counts.append(len(re.findall(r"\b" + t + r"\b", paper, 
re.IGNORECASE)))
   print p, counts[:5]
   if sum(counts) > 0:
      papernum = re.search("[0-9]+", papers[p]).group(0)
      countsfile.write(str(papernum) + " " + " ".join([str(s) for s in 
counts]) + "\n")

DaveA