about Python doc reader

norseman norseman at hughes.net
Wed May 13 23:48:47 CEST 2009


norseman wrote:
> Tim Golden wrote:
>> Shailja Gulati wrote:
>>> Hi ,
>>>
>>> I am currently working on "Information retrieval from semi structured 
>>> Documents" in which there is a need to read data from Resumes.
>>>
>>> Could anyone tell me is there any python API to read Word doc?
>>
>> If you haven't already, get hold of the pywin32 extensions:
>>
>>  http://pywin32.sf.net
>>
>> <code>
>> import win32com.client
>>
>> doc = win32com.client.GetObject ("c:/temp/temp.doc")
>> text = doc.Range ().Text
>>
>> </code>
>>
>> Note that this will give you a unicode object with \r line-delimiters.
>> You could read para by para if that were more useful:
>>
>> <code>
>> import win32com.client
>>
>> doc = win32com.client.GetObject ("c:/temp/temp.doc")
>> lines = [p.Range () for p in doc.Paragraphs]
>>
>> </code>
>>
>> TJG
> =======================
> I saw this right after responding to Kushal's 5:37AM today posting.
> 
> Thank you for the tip.  I'll try these first chance I get.
> Word, swriter, whatever - I'm not partial when it comes to automating.
> 
> 
> Today is: 20090513
> 
> Steve
================================
Interesting:

I did try these.

Doc at once:
outputs two x'0D' and the file.  Then it appends x'0D' x'0D' x'0A' x'0D' 
x'0A' to end of file even though source file itself has no EOL.
( EOL is EndOfLine  aka newline )

That's  cr cr             There are two blank lines at begining.
         cr cr lf cr lf    There is no EOL in source
                           Any idea what those are about?
One crlf is probably from python's print text, but the other?

The lines=
appends   [u'\r', u'\r', u"  to begining of output
and   \r"]x'0D'x'0A'   to the end even though there is no EOL in source.

output is understood:    u'\r'  is Apple EOL
the crlf is probably from print lines.

Programmers searching for specifics take note. The output is cooked.
I don't have any "weird things" in the test file. (no font changes, no 
subscripts, etc)  Might be best to take a real good look at a test file 
before assuming anything.

But, having an idea of what the extras are makes it somewhat easier to 
allow for.


Steve



More information about the Python-list mailing list