[python-win32] UnicodeEncodingError when print a doc file

Tom Hawkins Tom.Hawkins at innospecinc.com
Wed Jun 15 13:57:58 CEST 2011


>Thanks. I just find that all item numbers such as 1.1.1 are gone. How
can I get these 

>numbers. Also, If all items are in a table, how can I get the contents
of all items and 

>ignore the table structure. Thanks.

 

If you have auto-numbered paragraphs then I'd guess the numbers are
generated by Word as part of formatting the text for display and aren't
part of the literal text of the document.

 

Could you programatically or manually save the Word doc as plain text
and analyse the text file? That will get you the paragraph numbers as
text (at least it worked for me just now in a quick test on Word 2003)
and might solve your table issue as well.

 

Tom Hawkins

Principal Scientist

Innospec Inc

Tel: +44 (0)151 356 6197

Fax: +44 (0)151 356 6112

 

-----Original Message-----
From: python-win32-bounces+tom.hawkins=innospecinc.com at python.org
[mailto:python-win32-bounces+tom.hawkins=innospecinc.com at python.org] On
Behalf Of python-win32-request at python.org
Sent: 15 June 2011 02:21
To: python-win32 at python.org
Subject: python-win32 Digest, Vol 99, Issue 13

 

Send python-win32 mailing list submissions to

          python-win32 at python.org

 

To subscribe or unsubscribe via the World Wide Web, visit

          http://mail.python.org/mailman/listinfo/python-win32

or, via email, send a message with subject or body 'help' to

          python-win32-request at python.org

 

You can reach the person managing the list at

          python-win32-owner at python.org

 

When replying, please edit your Subject line so it is more specific

than "Re: Contents of python-win32 digest..."

 

 

Today's Topics:

 

   1. UnicodeEncodingError when print a doc file (cool_go_blue)

   2. Re: UnicodeEncodingError when print a doc file (Tim Roberts)

   3. Re: py2.7 and multiple DDE servers on Win32, possible? (RayS)

   4. Re: UnicodeEncodingError when print a doc file (cool_go_blue)

   5. Re: UnicodeEncodingError when print a doc file (Tim Roberts)

   6. Re: UnicodeEncodingError when print a doc file (cool_go_blue)

 

 

----------------------------------------------------------------------

 

Message: 1

Date: Tue, 14 Jun 2011 09:30:20 -0700 (PDT)

From: cool_go_blue <cool_go_blue at yahoo.com>

To: python-win32 at python.org

Subject: [python-win32] UnicodeEncodingError when print a doc file

Message-ID: <244357.87334.qm at web43142.mail.sp1.yahoo.com>

Content-Type: text/plain; charset="iso-8859-1"

 

I try to read a word document as follows:

 

 

            

            app = win32com.client.Dispatch('Word.Application')

doc = app.Documents.Open('D:\myfile.doc')

print doc.Content.Text

 

I receive the following error:

 

raceback (most recent call last):

? File "D:\projects\Myself\MySVD\src\ReadWord.py", line 11, in <module>

??? print doc.Content.Text

? File "D:\Softwares\Python27\lib\encodings\cp1252.py", line 12, in
encode

??? return codecs.charmap_encode(input,errors,encoding_table)

UnicodeEncodeError: 'charmap' codec can't encode character u'\uf06d' in
position 4397: character maps to <undefined>

 

How can I fix the problem. Thanks.

 

 

-------------- next part --------------

An HTML attachment was scrubbed...

URL:
<http://mail.python.org/pipermail/python-win32/attachments/20110614/224c
6fc5/attachment-0001.html>

 

------------------------------

 

Message: 2

Date: Tue, 14 Jun 2011 10:36:01 -0700

From: Tim Roberts <timr at probo.com>

To: Python-Win32 List <python-win32 at python.org>

Subject: Re: [python-win32] UnicodeEncodingError when print a doc file

Message-ID: <4DF79C01.5050703 at probo.com>

Content-Type: text/plain; charset="ISO-8859-1"

 

cool_go_blue wrote:

> I try to read a word document as follows:

> 

> app = win32com.client.Dispatch('Word.Application')

> doc = app.Documents.Open('D:\myfile.doc')

> print doc.Content.Text

> 

> I receive the following error:

> 

> raceback (most recent call last):

>   File "D:\projects\Myself\MySVD\src\ReadWord.py", line 11, in
<module>

>     print doc.Content.Text

>   File "D:\Softwares\Python27\lib\encodings\cp1252.py", line 12, in
encode

>     return codecs.charmap_encode(input,errors,encoding_table)

> UnicodeEncodeError: 'charmap' codec can't encode character u'\uf06d'

> in position 4397: character maps to <undefined>

> 

 

You are reading the Word document just fine.  The issue is printing it

to your terminal.  The document contains Unicode characters that aren't

present in your terminal's font.  You need to tell it how to handle the

conversion from Unicode to 8-bit.  Try this:

 

    print doc.Content.Text.encode('cp1252','replace')

 

That will print ? where invalid characters are found.

 

U+F06D is not a valid character.  It's in the "private use" area, so

it's possible this is some special code to Word.

 

-- 

Tim Roberts, timr at probo.com

Providenza & Boekelheide, Inc.

 

 

 

------------------------------

 

Message: 3

Date: Tue, 14 Jun 2011 11:34:25 -0700

From: RayS <rays at blue-cove.com>

To: <python-win32 at python.org>

Subject: Re: [python-win32] py2.7 and multiple DDE servers on Win32,

          possible?

Message-ID:

 
<20110614183431.RSZK17030.fed1rmfepo101.cox.net at fed1rmimpo01.cox.net>

Content-Type: text/plain; charset="us-ascii"; Format="flowed"

 

Ahh, this

http://www.codeproject.com/KB/MFC/MFCinVisualStudioExpress.aspx

might help.

I'll work through it later...

I do have VS6, but I'm assuming that I really need 2008 (?).

 

Thanks,

Ray

 

At 10:13 PM 6/13/2011, Roger Upole wrote:

>Compiling with VC 2008 Express is going to be a problem.  The free

>compiler doesn't seem to support using the atl/mfc headers and
libraries.

> 

>          Roger

-------------- next part --------------

An HTML attachment was scrubbed...

URL:
<http://mail.python.org/pipermail/python-win32/attachments/20110614/d496
5735/attachment-0001.html>

 

------------------------------

 

Message: 4

Date: Tue, 14 Jun 2011 17:49:25 -0700 (PDT)

From: cool_go_blue <cool_go_blue at yahoo.com>

To: python-win32 at python.org

Subject: Re: [python-win32] UnicodeEncodingError when print a doc file

Message-ID: <521123.44108.qm at web43131.mail.sp1.yahoo.com>

Content-Type: text/plain; charset="iso-8859-1"

 

Thanks. It works. Actually, what I want to do is to parse the whole
document. How can I retrieve the list of words in the

document? I use the following code:

 

for word in doc.Content.Text.encode("cp1252", "replace"):

??? print word

 

It seems that word is each a character. How can I find API to process
words in an open word document. Thanks.

 

 

--- On Tue, 6/14/11, Preston Landers <planders at gmail.com> wrote:

 

From: Preston Landers <planders at gmail.com>

Subject: Re: [python-win32] UnicodeEncodingError when print a doc file

To: "cool_go_blue" <cool_go_blue at yahoo.com>

Date: Tuesday, June 14, 2011, 12:37 PM

 

The document contains Unicode content that can't be rendered directly as
the encoding cp1252 (Windows-1252) used by your console when you use the
print statement.

You can always write the content to a file in UTF-8 or UTF-16 and then
view the file in a program like notepad that can handle Unicode.?I'm not
sure if there's any way to get the Windows console to produce actual
Unicode.

 

 

If you absolutely must print this in the console, you can always
substitute out unknown characters. ?

print doc.Content.Text.encode("cp1252", "replace")

 

 

Hope this helps,Preston

On Tue, Jun 14, 2011 at 11:30 AM, cool_go_blue <cool_go_blue at yahoo.com>
wrote:

 

 

I try to read a word document as follows:

 

 

 

 

            

            app = win32com.client.Dispatch('Word.Application')

doc = app.Documents.Open('D:\myfile.doc')

print doc.Content.Text

 

I receive the following error:

 

 

 

raceback (most recent call last):

? File "D:\projects\Myself\MySVD\src\ReadWord.py", line 11, in <module>

??? print doc.Content.Text

? File "D:\Softwares\Python27\lib\encodings\cp1252.py", line 12, in
encode

 

 

??? return codecs.charmap_encode(input,errors,encoding_table)

UnicodeEncodeError: 'charmap' codec can't encode character u'\uf06d' in
position 4397: character maps to <undefined>

 

How can I fix the problem. Thanks.

 

 

 

 

 

_______________________________________________

 

python-win32 mailing list

 

python-win32 at python.org

 

http://mail.python.org/mailman/listinfo/python-win32

 

 

 

 

-------------- next part --------------

An HTML attachment was scrubbed...

URL:
<http://mail.python.org/pipermail/python-win32/attachments/20110614/028a
c121/attachment-0001.html>

 

------------------------------

 

Message: 5

Date: Tue, 14 Jun 2011 18:02:06 -0700

From: Tim Roberts <timr at probo.com>

To: "python-win32 at python.org" <python-win32 at python.org>

Subject: Re: [python-win32] UnicodeEncodingError when print a doc file

Message-ID: <4DF8048E.6020400 at probo.com>

Content-Type: text/plain; charset="ISO-8859-1"

 

cool_go_blue wrote:

> Thanks. It works. Actually, what I want to do is to parse the whole

> document. How can I retrieve the list of words in the

> document? I use the following code:

> 

> for word in doc.Content.Text.encode("cp1252", "replace"):

>     print word

> 

> It seems that word is each a character.

> 

 

No, what you are getting back is a Python string.  When you enumerate

through a string, you get characters.  This is basic Python.

 

If your words are all separated by spaces, you can use split:

 

    for word in doc.Content.Text.encode("cp1252","replace").split():

        print word

 

Note, however, that you don't need to convert it to an 8-bit character

set until you want to print it.  If you are going to process these

words, then you might as well leave them in Unicode.

 

-- 

Tim Roberts, timr at probo.com

Providenza & Boekelheide, Inc.

 

 

 

------------------------------

 

Message: 6

Date: Tue, 14 Jun 2011 18:20:59 -0700 (PDT)

From: cool_go_blue <cool_go_blue at yahoo.com>

To: "python-win32 at python.org" <python-win32 at python.org>

Subject: Re: [python-win32] UnicodeEncodingError when print a doc file

Message-ID: <731230.99185.qm at web43140.mail.sp1.yahoo.com>

Content-Type: text/plain; charset="iso-8859-1"

 

Thanks. I just find that all item numbers such as 1.1.1 are gone. How
can I get these numbers. Also, If all items are in a table, how can I
get the contents of all items and ignore the table structure. Thanks. 

 

--- On Tue, 6/14/11, Tim Roberts <timr at probo.com> wrote:

 

From: Tim Roberts <timr at probo.com>

Subject: Re: [python-win32] UnicodeEncodingError when print a doc file

To: "python-win32 at python.org" <python-win32 at python.org>

Date: Tuesday, June 14, 2011, 9:02 PM

 

cool_go_blue wrote:

> Thanks. It works. Actually, what I want to do is to parse the whole

> document. How can I retrieve the list of words in the

> document? I use the following code:

> 

> for word in doc.Content.Text.encode("cp1252", "replace"):

>? ???print word

> 

> It seems that word is each a character.

> 

 

No, what you are getting back is a Python string.? When you enumerate

through a string, you get characters.? This is basic Python.

 

If your words are all separated by spaces, you can use split:

 

? ? for word in doc.Content.Text.encode("cp1252","replace").split():

? ? ? ? print word

 

Note, however, that you don't need to convert it to an 8-bit character

set until you want to print it.? If you are going to process these

words, then you might as well leave them in Unicode.

 

-- 

Tim Roberts, timr at probo.com

Providenza & Boekelheide, Inc.

 

_______________________________________________

python-win32 mailing list

python-win32 at python.org

http://mail.python.org/mailman/listinfo/python-win32

-------------- next part --------------

An HTML attachment was scrubbed...

URL:
<http://mail.python.org/pipermail/python-win32/attachments/20110614/951d
91e2/attachment.html>

 

------------------------------

 

_______________________________________________

python-win32 mailing list

python-win32 at python.org

http://mail.python.org/mailman/listinfo/python-win32

 

 

End of python-win32 Digest, Vol 99, Issue 13

********************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-win32/attachments/20110615/0829e2cc/attachment-0001.html>


More information about the python-win32 mailing list