[python-win32] Need help with AutoSummarize feature in Word

Daniel Greenfeld pydanny at gmail.com
Thu Apr 20 21:36:55 CEST 2006


Hello,

Last year I made the migration from Java to Python and have been
having lots of fun.  Just this month I got tasked with a COM
programming effort, and since none of us on the team are COM
programmers, we decided to do the effort in Python and I was assigned
the task.  And, of course, in our ignorance of COM programming, we ran
into a few snags.

The project I am on that requires that we go through 20,000 Word
Documents and perform autosummaries on each document.  I have
something that kind of works, but it has some issues.  Basically, the
code I wrote does the following:

1. Open the Word Document.
2. Do the AutoSummary
3. Save the results to a flat file for later parsing.
4. Close the Word Document.

The problem is that this just seems very inefficent.  It sometimes
neglects to close the word documents so then my computer gets loaded
with tons of open word documents (fortunately I have restricted the
number runs it goes through).  Also, in the case of yesterday, I
actually managed to break the COM server (or whatever it may be
called).

I'm sure there is a better way of doing things, but actually finding a
COM programmer who really understands COM is turning out to be harder
than I thought.  My hope is that I can get some pointers or even some
code fixes so I can progress forward on this effort.

If this is not the place to post this sort of request, please direct
me to where I should go.

For reference, here is the code as it stands:

""" This is a test script for playing with various COM word API items
using the win32com Python Lib
using the Active Python installation.  It works in ActivePython,
likely nowhere else.


Notes:

	1. This file need major cleanup
	2. This broke Word as a COM server.  Need to understand COM better
	3. Perhaps run multiple instances of this file?
"""

import os
import datetime # For performance analysis
import time
from win32com.client import gencache, constants, makepy #basic win32com objects

# COM constants that must be established
wdSummaryModeCreateNew  = 0x3
WORD                    = 'Word.Application'
False, True             = 0, -1

#other constants constants
breakOut    = 10 #How many docs to check before ending.  Set to -1 to
ensure the entire system is slurped.

# seperators used in output
docSepB      = '\n' + '<document>' + '\n' #Used to break documents
inside of the output file
docSepE      = '\n' + '</document>' + '\n' #Used to break documents
inside of the output file
sumSepB      = '\n' + '<summary>' + '\n' #Used to break summary
percentage displays inside of a document section in the output file.
sumSepE      = '\n' + '</summary>' + '\n' #Used to break summary
percentage displays inside of a document section in the output file.

dumpfile = open('test.txt', 'w')

class Word:
    """ I represent all the fun of playing with a MS Word document."""

    def __init__(self):
        """ I initialize the COM object library for word. """

        self.app = gencache.EnsureDispatch(WORD)
        self.summaryPercentages = (5, 10, 18, 25)
        self.errors = 0

    def open(self, doc):
        """ I open the Word file to be autosummarized. """

        self.app.Documents.Open(FileName = doc)

    def autoSummarize(self, Length = 30, Mode =
wdSummaryModeCreateNew, UpdateProperties = True):
        """ I do the autosummary and return the content.  This
actually creates a new tmp word file."""
        try:
            self.app.ActiveDocument.AutoSummarize(Length, Mode,
UpdateProperties)
            return word.app.ActiveDocument.Content.Text
        except:
            self.errors += 1

        return ''

    def close(self):
        """ I close the Word document."""

        self.app.ActiveDocument.Close(SaveChanges=False)

if __name__ == '__main__':

    print '*'*80
    word = Word()
    startTime = datetime.datetime.now()
    count = 0



    for root, dirs, files in os.walk('C:/wordData/'):
        for file in files:
            #in case we get a non-word doc or if it is a word temp
file that somehow got saved.
            if file.lower().endswith('.doc') and not file.startswith('~'):
                fileName = os.path.join(root, file)
            else: continue

            print 'File ' + fileName
            dumpfile.write(docSepB)
            dumpfile.write(fileName + '\n')
            for value in word.summaryPercentages:
                word.open(fileName)
                print value
                dumpfile.write(sumSepB)
                dumpfile.write('Length: ' + str(value) + '\n')
                try:
                    data = str(word.autoSummarize(Length=value))
                except:
                    data = ''
                #print data
                if len(data.strip()):
                    dumpfile.write(data)
                else:
                    dumpfile.write('No Summary')

                dumpfile.write(sumSepE)
                word.close()
                time.sleep(1)


            dumpfile.write(docSepE)    # closing of the doc
            dumpfile.write('*' * 80 + '\n')    # closing of the doc
            word.close()
            time.sleep(3)
            count += 1

            if count == breakOut:
                break
        if count == breakOut:
            break
    print 'Done: ' + str(datetime.datetime.now() - startTime)


More information about the Python-win32 mailing list