[python-win32] UnicodeEncodingError when print a doc file

cool_go_blue cool_go_blue at yahoo.com
Wed Jun 15 16:16:17 CEST 2011


Thanks for your suggestion,  Tom. Yes.I think it's an auto-number paragraph. How can I auto programatically
save the word file in Python? The manual save means I
copy and paste to a wordpad file, right?

--- On Wed, 6/15/11, Tom Hawkins <Tom.Hawkins at innospecinc.com> wrote:

From: Tom Hawkins <Tom.Hawkins at innospecinc.com>
Subject: Re: [python-win32] UnicodeEncodingError when print a doc file
To: python-win32 at python.org
Date: Wednesday, June 15, 2011, 7:57 AM




 
 

 







 



>Thanks. I
just find that all item numbers such as 1.1.1 are gone. How can I get these  

>numbers.
Also, If all items are in a table, how can I get the contents of all items and  

>ignore the
table structure. Thanks. 

   

If you have
auto-numbered paragraphs then I’d guess the numbers are generated by Word
as part of formatting the text for display and aren’t part of the literal
text of the document. 

   

Could you
programatically or manually save the Word doc as plain text and analyse the
text file? That will get you the paragraph numbers as text (at least it worked
for me just now in a quick test on Word 2003) and might solve your table issue
as well. 

   

Tom Hawkins 

Principal
Scientist 

Innospec Inc 

Tel: +44 (0)151
356 6197 

Fax: +44 (0)151
356 6112 

   

-----Original
Message-----

From: python-win32-bounces+tom.hawkins=innospecinc.com at python.org
[mailto:python-win32-bounces+tom.hawkins=innospecinc.com at python.org] On Behalf
Of python-win32-request at python.org

Sent: 15 June 2011 02:21

To: python-win32 at python.org

Subject: python-win32 Digest, Vol 99, Issue 13

   

Send python-win32
mailing list submissions to 

          python-win32 at python.org 

   

To subscribe or
unsubscribe via the World Wide Web, visit 

          http://mail.python.org/mailman/listinfo/python-win32 

or, via email,
send a message with subject or body 'help' to 

          python-win32-request at python.org 

   

You can reach the
person managing the list at 

          python-win32-owner at python.org 

   

When replying,
please edit your Subject line so it is more specific 

than "Re:
Contents of python-win32 digest..." 

   

   

Today's Topics: 

   

   1.
UnicodeEncodingError when print a doc file (cool_go_blue) 

   2.
Re: UnicodeEncodingError when print a doc file (Tim Roberts) 

   3.
Re: py2.7 and multiple DDE servers on Win32, possible? (RayS) 

   4. Re:
UnicodeEncodingError when print a doc file (cool_go_blue) 

   5.
Re: UnicodeEncodingError when print a doc file (Tim Roberts) 

   6.
Re: UnicodeEncodingError when print a doc file (cool_go_blue) 

   

   

---------------------------------------------------------------------- 

   

Message: 1 

Date: Tue, 14 Jun
2011 09:30:20 -0700 (PDT) 

From:
cool_go_blue <cool_go_blue at yahoo.com> 

To:
python-win32 at python.org 

Subject:
[python-win32] UnicodeEncodingError when print a doc file 

Message-ID:
<244357.87334.qm at web43142.mail.sp1.yahoo.com> 

Content-Type:
text/plain; charset="iso-8859-1" 

   

I try to read a
word document as follows: 

   

   

           
 

           
app = win32com.client.Dispatch('Word.Application') 

doc =
app.Documents.Open('D:\myfile.doc') 

print doc.Content.Text 

   

I receive the
following error: 

   

raceback (most
recent call last): 

? File
"D:\projects\Myself\MySVD\src\ReadWord.py", line 11, in
<module> 

??? print
doc.Content.Text 

? File
"D:\Softwares\Python27\lib\encodings\cp1252.py", line 12, in encode 

??? return
codecs.charmap_encode(input,errors,encoding_table) 

UnicodeEncodeError:
'charmap' codec can't encode character u'\uf06d' in position 4397: character
maps to <undefined> 

   

How can I fix the
problem. Thanks. 

   

   

--------------
next part -------------- 

An HTML
attachment was scrubbed... 

URL:
<http://mail.python.org/pipermail/python-win32/attachments/20110614/224c6fc5/attachment-0001.html> 

   

------------------------------ 

   

Message: 2 

Date: Tue, 14 Jun
2011 10:36:01 -0700 

From: Tim Roberts
<timr at probo.com> 

To: Python-Win32
List <python-win32 at python.org> 

Subject: Re:
[python-win32] UnicodeEncodingError when print a doc file 

Message-ID:
<4DF79C01.5050703 at probo.com> 

Content-Type:
text/plain; charset="ISO-8859-1" 

   

cool_go_blue wrote: 

> I try to
read a word document as follows: 

>   

> app =
win32com.client.Dispatch('Word.Application') 

> doc =
app.Documents.Open('D:\myfile.doc') 

> print
doc.Content.Text 

>   

> I receive
the following error: 

>   

> raceback
(most recent call last): 

>  
File "D:\projects\Myself\MySVD\src\ReadWord.py", line 11, in
<module> 

>    
print doc.Content.Text 

>  
File "D:\Softwares\Python27\lib\encodings\cp1252.py", line 12, in
encode 

>    
return codecs.charmap_encode(input,errors,encoding_table) 

>
UnicodeEncodeError: 'charmap' codec can't encode character u'\uf06d' 

> in position
4397: character maps to <undefined> 

>   

   

You are reading
the Word document just fine.  The issue is printing it 

to your
terminal.  The document contains Unicode characters that aren't 

present in your
terminal's font.  You need to tell it how to handle the 

conversion from
Unicode to 8-bit.  Try this: 

   

   
print doc.Content.Text.encode('cp1252','replace') 

   

That will print ?
where invalid characters are found. 

   

U+F06D is not a
valid character.  It's in the "private use" area, so 

it's possible
this is some special code to Word. 

   

--  

Tim Roberts,
timr at probo.com 

Providenza &
Boekelheide, Inc. 

   

   

   

------------------------------ 

   

Message: 3 

Date: Tue, 14 Jun
2011 11:34:25 -0700 

From: RayS
<rays at blue-cove.com> 

To:
<python-win32 at python.org> 

Subject: Re:
[python-win32] py2.7 and multiple DDE servers on Win32, 

          possible? 

Message-ID: 

          <20110614183431.RSZK17030.fed1rmfepo101.cox.net at fed1rmimpo01.cox.net> 

Content-Type:
text/plain; charset="us-ascii"; Format="flowed" 

   

Ahh, this 

http://www.codeproject.com/KB/MFC/MFCinVisualStudioExpress.aspx 

might help. 

I'll work through
it later... 

I do have VS6,
but I'm assuming that I really need 2008 (?). 

   

Thanks, 

Ray 

   

At 10:13 PM
6/13/2011, Roger Upole wrote: 

>Compiling
with VC 2008 Express is going to be a problem.  The free 

>compiler
doesn't seem to support using the atl/mfc headers and libraries. 

>   

>         
Roger 

--------------
next part -------------- 

An HTML attachment
was scrubbed... 

URL:
<http://mail.python.org/pipermail/python-win32/attachments/20110614/d4965735/attachment-0001.html> 

   

------------------------------ 

   

Message: 4 

Date: Tue, 14 Jun
2011 17:49:25 -0700 (PDT) 

From:
cool_go_blue <cool_go_blue at yahoo.com> 

To:
python-win32 at python.org 

Subject: Re:
[python-win32] UnicodeEncodingError when print a doc file 

Message-ID:
<521123.44108.qm at web43131.mail.sp1.yahoo.com> 

Content-Type:
text/plain; charset="iso-8859-1" 

   

Thanks. It works.
Actually, what I want to do is to parse the whole document. How can I retrieve
the list of words in the 

document? I use
the following code: 

   

for word in
doc.Content.Text.encode("cp1252", "replace"): 

??? print word 

   

It seems that
word is each a character. How can I find API to process words in an open word
document. Thanks. 

   

   

--- On Tue,
6/14/11, Preston Landers <planders at gmail.com> wrote: 

   

From: Preston
Landers <planders at gmail.com> 

Subject: Re:
[python-win32] UnicodeEncodingError when print a doc file 

To:
"cool_go_blue" <cool_go_blue at yahoo.com> 

Date: Tuesday,
June 14, 2011, 12:37 PM 

   

The document
contains Unicode content that can't be rendered directly as the encoding cp1252
(Windows-1252) used by your console when you use the print statement. 

You can always
write the content to a file in UTF-8 or UTF-16 and then view the file in a
program like notepad that can handle Unicode.?I'm not sure if there's any way
to get the Windows console to produce actual Unicode. 

   

   

If you absolutely
must print this in the console, you can always substitute out unknown
characters. ? 

print
doc.Content.Text.encode("cp1252", "replace") 

   

   

Hope this helps,
 Preston 

On Tue, Jun 14,
2011 at 11:30 AM, cool_go_blue <cool_go_blue at yahoo.com> wrote: 

   

   

I try to read a
word document as follows: 

   

   

   

   

           
 

           
app = win32com.client.Dispatch('Word.Application') 

doc =
app.Documents.Open('D:\myfile.doc') 

print
doc.Content.Text 

   

I receive the
following error: 

   

   

   

raceback (most
recent call last): 

? File
"D:\projects\Myself\MySVD\src\ReadWord.py", line 11, in
<module> 

??? print
doc.Content.Text 

? File
"D:\Softwares\Python27\lib\encodings\cp1252.py", line 12, in encode 

   

   

??? return
codecs.charmap_encode(input,errors,encoding_table) 

UnicodeEncodeError:
'charmap' codec can't encode character u'\uf06d' in position 4397: character
maps to <undefined> 

   

How can I fix the
problem. Thanks. 

   

   

   

   

   

_______________________________________________ 

   

python-win32
mailing list 

   

python-win32 at python.org 

   

http://mail.python.org/mailman/listinfo/python-win32 

   

   

   

   

--------------
next part -------------- 

An HTML
attachment was scrubbed... 

URL:
<http://mail.python.org/pipermail/python-win32/attachments/20110614/028ac121/attachment-0001.html> 

   

------------------------------ 

   

Message: 5 

Date: Tue, 14 Jun
2011 18:02:06 -0700 

From: Tim Roberts
<timr at probo.com> 

To:
"python-win32 at python.org" <python-win32 at python.org> 

Subject: Re:
[python-win32] UnicodeEncodingError when print a doc file 

Message-ID:
<4DF8048E.6020400 at probo.com> 

Content-Type:
text/plain; charset="ISO-8859-1" 

   

cool_go_blue
wrote: 

> Thanks. It
works. Actually, what I want to do is to parse the whole 

> document.
How can I retrieve the list of words in the 

> document? I
use the following code: 

>   

> for word in
doc.Content.Text.encode("cp1252", "replace"): 

>    
print word 

>   

> It seems
that word is each a character. 

>   

   

No, what you are
getting back is a Python string.  When you enumerate 

through a string,
you get characters.  This is basic Python. 

   

If your words are
all separated by spaces, you can use split: 

   

   
for word in
doc.Content.Text.encode("cp1252","replace").split(): 

       
print word 

   

Note, however,
that you don't need to convert it to an 8-bit character 

set until you
want to print it.  If you are going to process these 

words, then you
might as well leave them in Unicode. 

   

--  

Tim Roberts,
timr at probo.com 

Providenza &
Boekelheide, Inc. 

   

   

   

------------------------------ 

   

Message: 6 

Date: Tue, 14 Jun
2011 18:20:59 -0700 (PDT) 

From: cool_go_blue
<cool_go_blue at yahoo.com> 

To:
"python-win32 at python.org" <python-win32 at python.org> 

Subject: Re:
[python-win32] UnicodeEncodingError when print a doc file 

Message-ID:
<731230.99185.qm at web43140.mail.sp1.yahoo.com> 

Content-Type:
text/plain; charset="iso-8859-1" 

   

Thanks. I just
find that all item numbers such as 1.1.1 are gone. How can I get these numbers.
Also, If all items are in a table, how can I get the contents of all items and
ignore the table structure. Thanks.  

   

--- On Tue,
6/14/11, Tim Roberts <timr at probo.com> wrote: 

   

From: Tim Roberts
<timr at probo.com> 

Subject: Re:
[python-win32] UnicodeEncodingError when print a doc file 

To:
"python-win32 at python.org" <python-win32 at python.org> 

Date: Tuesday,
June 14, 2011, 9:02 PM 

   

cool_go_blue
wrote: 

> Thanks. It
works. Actually, what I want to do is to parse the whole 

> document.
How can I retrieve the list of words in the 

> document? I
use the following code: 

>   

> for word in
doc.Content.Text.encode("cp1252", "replace"): 

>? ???print
word 

>   

> It seems
that word is each a character. 

>   

   

No, what you are
getting back is a Python string.? When you enumerate 

through a string,
you get characters.? This is basic Python. 

   

If your words are
all separated by spaces, you can use split: 

   

? ? for word in
doc.Content.Text.encode("cp1252","replace").split(): 

? ? ? ? print
word 

   

Note, however,
that you don't need to convert it to an 8-bit character 

set until you
want to print it.? If you are going to process these 

words, then you
might as well leave them in Unicode. 

   

--  

Tim Roberts,
timr at probo.com 

Providenza &
Boekelheide, Inc. 

   

_______________________________________________ 

python-win32
mailing list 

python-win32 at python.org 

http://mail.python.org/mailman/listinfo/python-win32 

--------------
next part -------------- 

An HTML
attachment was scrubbed... 

URL:
<http://mail.python.org/pipermail/python-win32/attachments/20110614/951d91e2/attachment.html> 

   

------------------------------ 

   

_______________________________________________ 

python-win32
mailing list 

python-win32 at python.org 

http://mail.python.org/mailman/listinfo/python-win32 

   

   

End of
python-win32 Digest, Vol 99, Issue 13 

******************************************** 



 



-----Inline Attachment Follows-----

_______________________________________________
python-win32 mailing list
python-win32 at python.org
http://mail.python.org/mailman/listinfo/python-win32
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-win32/attachments/20110615/25fbccb6/attachment-0001.html>


More information about the python-win32 mailing list