Finding size of Variable

Ayushi Dalmia ayushidalmia2604 at gmail.com
Wed Feb 5 06:15:25 CET 2014


On Wednesday, February 5, 2014 12:51:31 AM UTC+5:30, Dave Angel wrote:
> Ayushi Dalmia <ayushidalmia2604 at gmail.com> Wrote in message:
> 
> 
> 
> > 
> 
> > Where am I going wrong? What are the alternatives I can try?
> 
> 
> 
> You've rejected all the alternatives so far without showing your
> 
>  code, or even properly specifying your problem.
> 
> 
> 
> To get the "total" size of a list of strings,  try (untested):
> 
> 
> 
> a = sys.getsizeof (mylist )
> 
> for item in mylist:
> 
>     a += sys.getsizeof (item)
> 
> 
> 
> This can be high if some of the strings are interned and get
> 
>  counted twice. But you're not likely to get closer without some
> 
>  knowledge of the data objects and where they come
> 
>  from.
> 
> 
> 
> -- 
> 
> DaveA

Hello Dave, 

I just thought that saving others time is better and hence I explained only the subset of my problem. Here is what I am trying to do:

I am trying to index the current wikipedia dump without using databases and create a search engine for Wikipedia documents. Note, I CANNOT USE DATABASES.
My approach:

I am parsing the wikipedia pages using SAX Parser, and then, I am dumping the words along with the posting list (a list of doc ids in which the word is present) into different files after reading 'X' number of pages. Now these files may have the same word and hence I need to merge them and write the final index again. Now these final indexes must be of limited size as I need to be of limited size. This is where I am stuck. I need to know how to determine the size of content in a variable before I write into the file.

Here is the code for my merging:

def mergeFiles(pathOfFolder, countFile):
    listOfWords={}
    indexFile={}
    topOfFile={}
    flag=[0]*countFile
    data=defaultdict(list)
    heap=[]
    countFinalFile=0
    for i in xrange(countFile):
        fileName = pathOfFolder+'\index'+str(i)+'.txt.bz2'
        indexFile[i]= bz2.BZ2File(fileName, 'rb')
        flag[i]=1
        topOfFile[i]=indexFile[i].readline().strip()
        listOfWords[i] = topOfFile[i].split(' ')
        if listOfWords[i][0] not in heap:
            heapq.heappush(heap, listOfWords[i][0])        
            
    while any(flag)==1:
        temp = heapq.heappop(heap)
        for i in xrange(countFile):
            if flag[i]==1:
                if listOfWords[i][0]==temp:

                    //This is where I am stuck. I cannot wait until memory //error, as I need to do some postprocessing too.
                    try:
                        data[temp].extend(listOfWords[i][1:])
                    except MemoryError:
                        writeFinalIndex(data, countFinalFile, pathOfFolder)
                        data=defaultdict(list)
                        countFinalFile+=1

                    topOfFile[i]=indexFile[i].readline().strip()   
                    if topOfFile[i]=='':
                            flag[i]=0
                            indexFile[i].close()
                            os.remove(pathOfFolder+'\index'+str(i)+'.txt.bz2')
                    else:
                        listOfWords[i] = topOfFile[i].split(' ')
                        if listOfWords[i][0] not in heap:
                            heapq.heappush(heap, listOfWords[i][0])
    writeFinalIndex(data, countFinalFile, pathOfFolder)

countFile is the number of files and writeFileIndex method writes into the file.



More information about the Python-list mailing list