[Tutor] Is the difference in outputs with different size input lists due to limits on memory with PYTHON?

Fri May 7 02:52:26 CEST 2010

Art Kendall wrote:
>
>
> On 5/6/2010 1:51 PM, Dave Angel wrote:
>> Art Kendall wrote:
>>>
>>>
>>> On 5/6/2010 11:14 AM, Dave Angel wrote:
>>>> Art Kendall wrote:
>>>>> I am running Windows 7 64bit Home premium. with quad cpus and 8G 
>>>>> memory.   I am using Python 2.6.2.
>>>>>
>>>>> I have all the Federalist Papers concatenated into one .txt file.
>>>> Which is how big?  Currently you (unnecessarily) load the entire 
>>>> thing into memory with readlines().  And then you do confusing work 
>>>> to split it apart again, into one list element per paper.   And for 
>>>> a while there, you have three copies of the entire text.  You're 
>>>> keeping two copies, in the form of alltext and papers.
>>>> You print out the len(papers).  What do you see there?  Is it 
>>>> correctly 87 ?  If it's not, you have to fix the problem here, 
>>>> before even going on.
>>>>
>>>>>   I want to prepare a file with a row for each paper and a column 
>>>>> for each term. The cells would contain the count of a term in that 
>>>>> paper.  In the original application in the 1950's 30 single word 
>>>>> terms were used. I can now use NoteTab to get a list of all the 
>>>>> 8708 separate words in allWords.txt. I can then use that data in 
>>>>> statistical exploration of the set of texts.
>>>>>
>>>>> I have the python program(?) syntax(?) script(?) below that I am 
>>>>> using to learn PYTHON. The comments starting with "later" are 
>>>>> things I will try to do to make this more useful. I am getting one 
>>>>> step at at time to work
>>>>>
>>>>> It works when the number of terms in the term list is small e.g., 
>>>>> 10.  I get a file with the correct number of rows (87) and count 
>>>>> columns (10) in termcounts.txt. The termcounts.txt file is not 
>>>>> correct when I have a larger number of terms, e.g., 100. I get a 
>>>>> file with only 40 rows and the correct number of columns.  With 
>>>>> 8700 terms I get only 40 rows I need to be able to have about 8700 
>>>>> terms. (If this were FORTRAN I would say that the subscript 
>>>>> indices were getting scrambled.)  (As I develop this I would like 
>>>>> to be open-ended with the numbers of input papers and open ended 
>>>>> with the number of words/terms.)
>>>>>
>>>>>
>>>>>
>>>>> # word counts: Federalist papers
>>>>>
>>>>> import re, textwrap
>>>>> # read the combined file and split into individual papers
>>>>> # later create a new version that deals with all files in a folder 
>>>>> rather than having papers concatenated
>>>>> alltext = file("C:/Users/Art/Desktop/fed/feder16v3.txt").readlines()
>>>>> papers= re.split(r'FEDERALIST No\.'," ".join(alltext))
>>>>> print len(papers)
>>>>>
>>>>> countsfile = file("C:/Users/Art/desktop/fed/TermCounts.txt", "w")
>>>>> syntaxfile = file("C:/Users/Art/desktop/fed/TermCounts.sps", "w")
>>>>> # later create a python program that extracts all words instead of 
>>>>> using NoteTab
>>>>> termfile   = open("C:/Users/Art/Desktop/fed/allWords.txt")
>>>>> termlist = termfile.readlines()
>>>>> termlist = [item.rstrip("\n") for item in termlist]
>>>>> print len(termlist)
>>>>> # check for SPSS reserved words
>>>>> varnames = textwrap.wrap(" ".join([v.lower() in ['and', 'or', 
>>>>> 'not', 'eq', 'ge',
>>>>> 'gt', 'le', 'lt', 'ne', 'all', 'by', 'to','with'] and (v+"_r") or 
>>>>> v for v in termlist]))
>>>>> syntaxfile.write("data list file= 
>>>>> 'c:/users/Art/desktop/fed/termcounts.txt' free/docnumber\n")
>>>>> syntaxfile.writelines([v + "\n" for v in varnames])
>>>>> syntaxfile.write(".\n")
>>>>> # before using the syntax manually replace spaces internal to a 
>>>>> string to underscore // replace (ltrtim(rtrim(varname))," ","_")   
>>>>> replace any special characters with @ in variable names
>>>>>
>>>>>
>>>>> for p in range(len(papers)):
>>>> range(len()) is un-pythonic.  Simply do
>>>>         for paper in papers:
>>>>
>>>> and of course use paper below instead of papers[p]
>>>>>    counts = []
>>>>>    for t in termlist:
>>>>>       counts.append(len(re.findall(r"\b" + t + r"\b", papers[p], 
>>>>> re.IGNORECASE)))
>>>>>    if sum(counts) > 0:
>>>>>       papernum = re.search("[0-9]+", papers[p]).group(0)
>>>>>       countsfile.write(str(papernum) + " " + " ".join([str(s) for 
>>>>> s in counts]) + "\n")
>>>>>
>>>>>
>>>>> Art
>>>>>
>>>> If you're memory limited, you really should sequence through the 
>>>> files, only loading one at a time, rather than all at once.  It's 
>>>> no harder.  Use dirlist() to make a list of files, then your loop 
>>>> becomes something like:
>>>>
>>>> for  infile in filelist:
>>>>      paper = " ".join(open(infile, "r").readlines())
>>>>
>>>> Naturally, to do it right, you should use    with...  Or at least 
>>>> close each file when done.
>>>>
>>>> DaveA
>>>>
>>>>
>>>
>>> Thank you for getting back to me. I am trying to generalize a 
>>> process that 50 years ago used 30 terms on the whole file and I am 
>>> using the task of generalizing the process to learn python.   In the 
>>> post I sent there were comments to myself about things that I would 
>>> want to learn about.  One of the first is to learn about processing 
>>> all files in a folder, so your reply will be very helpful.  It seems 
>>> that dirlist() should allow me to include the filespec in the output 
>>> file which would be very helpful.
>>>
>>> to rephrase my questions.
>>> Is there a way to tell python to use more RAM?
>>>
>>> Does python use the same array space over as it counts the 
>>> occurrences for each input document? Or does it keep every row of 
>>> the output someplace even after it has written it to the output? If 
>>> it does keep old arrays, is there a way to "close" the output array 
>>> in RAM between documents
>>>
>>> I narrowed down the problem.  With 4035 terms it runs OK.  With 4040 
>>> the end of the output matrix is messed up.  I do not think it is a 
>>> limit of my resources that gets in the way.  I have 352G of free 
>>> hard disk if it goes virtual.   I have 8G of RAM.  Even if python 
>>> turns out to be strictly 32Bit I think it would be able to use 3G of 
>>> RAM.  The input file is 1.1M so that should be able to fit in RAM 
>>> many times.
>>>
>>> P.S. I hope I remembered correctly that this list put replies at the 
>>> bottom.
>>> Art
>>>
>> Python comes in 32 and 64 bit versions, so it depends on which you're 
>> running.  A 32bit executable under Windows is restricted to 2gb, 
>> regardless of physical RAM or disk capacity.  There is a way to 
>> configure that in boot.ini to use 3gb instead, but it doesn't work in 
>> all circumstances.  Perhaps in 64bit Windows, it lets you use 3gb.
>>
>> I'm not so sure your problem has anything to do with memory, 
>> however.  If your total input is under 2meg, then it's almost 
>> certainly not.  But you could get some ideas by examining the 
>> len(papers) as I said, and also len(alltext)
>>
>> You ask how to free memory.  I'll assume you're using CPython, 
>> perhaps version 2.6.  If you set a variable to None, it'll free the 
>> previous object it pointed at.  So when you're done with alltext, you 
>> can simply set it to None.  Or use the "del" statement, which also 
>> frees the name itself.  That's already the case with your loop, with 
>> the counts variable.  Each time through the loop, it gets reassigned 
>> to [], freeing the previous counts entirely.
>>
>> If your problem were indeed memory, you could process one file at a 
>> time, and cut it by some 80-fold.   And if that's not enough, you 
>> could process each file one line at a time.
>>
>> You should be able to find your real bug with a judicious use of 
>> prints.  Are you actually looping through that final for loop 87 
>> times?  Or maybe some files don't begin with the word FEDERALIST ?  
>> or some files don't have any matches.  (To check that, add an else 
>> clause for your if sum().
>>
>> DaveA
>>
>>
>>
>
> Dave,
> Thank you.
> 87 is what  print len(papers) puts on the screen at the beginning of 
> the run. There are 86 papers in the file.
>
> I checked and each paper starts with "FEDERALIST No."
>
> When I use the 30 original terms, or the 70 used later by others, the 
> output data has the correct document numbers, 1-69, 2 versions of 70, 
> and 71 to 85 in the 86 rows of the output. (which is what I see when I 
> read the text into a word processor). Also the number of output lines 
> that do not start with the correct document number increases as the 
> number of terms increases past 4035.  4045 and 4045 have 84 lines 
> start correctly. 8000 terms has only the first document number read 
> correctly.
>
>  I make no changes to the python code when I run with a longer list of 
> terms.  I make no changes to the original txt file I received.  All I 
> change is the number of terms in allWords.txt. All of the longer lists 
> of terms include the terms on the shorter list so counts should not be 
> sparser with a longer list of terms to count.  All papers should have 
> some counts.
>
> I checked and the python screen says
> Python 2.6.2 (r262:71605, Apr 14 2009, 22:46:50) [MSC v.1500 64 bit 
> (AMD64)] on win32
> so RAM cannot be the problem.
>
> I'll cut the big file down into 1 paper per file, put the paper number 
> into the name of the file, and try that. I only need the papers 
> concatenated to get the list of all words that occur in any file.  
> Right now I use NoteTab to cut and paste that list anyways so I don't 
> need to have 1 big file for python.  (As I learn python a later task 
> would be to generate that list via python.)
>
> BTW is Python some kind of a grandchild to Algol which was around in 
> the early 70's?  It seems reminiscent.
>
>
> Art
>
I got my own copy of the papers, at 
http://thomas.loc.gov/home/histdox/fedpaper.txt

I copied your code, and added logic to it to initialize termlist from 
the actual file.  And it does complete the output file at 83 lines, 
approx 17000 columns per line (because most counts are one digit).  It 
takes quite a while, and perhaps you weren't waiting for it to 
complete.  I'd suggest either adding a print to the loop, showing the 
count, and/or adding a line that prints "done" after the loop terminates 
normally.

I watched memory usage, and as expected, it didn't get very high.  There 
are things you need to redesign, however.  One is that all the 
punctuation and digits and such need to be converted to spaces.

DaveA