[python-win32] Need help with AutoSummarize feature in Word

Tim Roberts timr at probo.com
Fri Apr 21 19:00:33 CEST 2006


On Thu, 20 Apr 2006 15:36:55 -0400, "Daniel Greenfeld"
<pydanny at gmail.com> wrote:

>Last year I made the migration from Java to Python and have been
>having lots of fun.  Just this month I got tasked with a COM
>programming effort, and since none of us on the team are COM
>programmers, we decided to do the effort in Python and I was assigned
>the task.  And, of course, in our ignorance of COM programming, we ran
>into a few snags.
>
>The project I am on that requires that we go through 20,000 Word
>Documents and perform autosummaries on each document.  I have
>something that kind of works, but it has some issues.  Basically, the
>code I wrote does the following:
>
>1. Open the Word Document.
>2. Do the AutoSummary
>3. Save the results to a flat file for later parsing.
>4. Close the Word Document.
>
>The problem is that this just seems very inefficent.
>

There is simply no efficient way to do this.  What you have is basically
the right approach, with some tweaking.


>It sometimes
>neglects to close the word documents so then my computer gets loaded
>with tons of open word documents
>

Right, because you close the summary, but you  never close the original
document.


>import os
>import datetime # For performance analysis
>import time
>from win32com.client import gencache, constants, makepy #basic win32com objects
>
># COM constants that must be established
>wdSummaryModeCreateNew  = 0x3
>  
>

That constant should be in win32com.client.constants after you do your
EnsureDispatch.

>WORD                    = 'Word.Application'
>False, True             = 0, -1
>  
>

That's an incredibly bad idea.  Python has intrinsic constants called
False and True with very different values, and you should be able to use
them with APIs that expect a Boolean.

># seperators used in output
>docSepB      = '\n' + '<document>' + '\n' #Used to break documents
>inside of the output file
>docSepE      = '\n' + '</document>' + '\n' #Used to break documents
>inside of the output file
>sumSepB      = '\n' + '<summary>' + '\n' #Used to break summary
>percentage displays inside of a document section in the output file.
>sumSepE      = '\n' + '</summary>' + '\n' #Used to break summary
>percentage displays inside of a document section in the output file.
>  
>

Kind of a trivial note -- the + operator on strings is inefficient. 
It's better just to create the constants as one chunk, but since this is
only a one-time thing, it really doesn't matter.

>class Word:
>    """ I represent all the fun of playing with a MS Word document."""
>
>    def __init__(self):
>        """ I initialize the COM object library for word. """
>
>        self.app = gencache.EnsureDispatch(WORD)
>        self.summaryPercentages = (5, 10, 18, 25)
>        self.errors = 0
>
>    def open(self, doc):
>        """ I open the Word file to be autosummarized. """
>
>        self.app.Documents.Open(FileName = doc)
>  
>

The Open API returns a Document object.  You should save that Document
object, so that you can close it later.

        self.original = self.app.Documents.Open( FileName = doc )

>    def autoSummarize(self, Length = 30, Mode =
>wdSummaryModeCreateNew, UpdateProperties = True):
>        """ I do the autosummary and return the content.  This
>actually creates a new tmp word file."""
>        try:
>            self.app.ActiveDocument.AutoSummarize(Length, Mode,
>UpdateProperties)
>  
>

AutoSummarize returns a Range object.  It may be possible to get the
text directly from this Range object, instead of relying on the
ActiveDocument property.

>            return word.app.ActiveDocument.Content.Text
>  
>

You really want "self" instead of "word" here.  You're getting the
global variable "word", which happens to be the same thing in this case,
but better to do it right.

>        except:
>            self.errors += 1
>
>        return ''
>
>    def close(self):
>        """ I close the Word document."""
>
>        self.app.ActiveDocument.Close(SaveChanges=False)
>  
>

Here, you're closing the summary.  You also need to do:
        self.original.Close()


>    for root, dirs, files in os.walk('C:/wordData/'):
>        for file in files:
>            #in case we get a non-word doc or if it is a word temp
>file that somehow got saved.
>            if file.lower().endswith('.doc') and not file.startswith('~'):
>                fileName = os.path.join(root, file)
>            else: continue
>
>            print 'File ' + fileName
>            dumpfile.write(docSepB)
>            dumpfile.write(fileName + '\n')
>            for value in word.summaryPercentages:
>                word.open(fileName)
>                print value
>  
>

Since you're running 4 different summaries from the same original, why
not do the word.open outside of the loop?  You would have to change the
Word class to close the autosummary document in the autoSummarize call,
instead of in close, but that's easy.

-- 
Tim Roberts, timr at probo.com
Providenza & Boekelheide, Inc.



More information about the Python-win32 mailing list