[XML-SIG] Advice needed: RTF->XML conversions

Jeremy J. Sydik jsydik@virtualparadigm.com
Thu, 17 May 2001 18:14:30 -0500

Martin is right.  The Office/Word 'XML' can be a difficult thing to work
with.  It's been a while since i've thought about it, but you will probably
need to account for the following:

	* Not all attributes are quoted
	* Singleton tags aren't closed (This can be dealt with fairly easily,
            however.  It's simply the 'standard' singleton html tags that
		occur this way (br, img, etc).
	* There are a few microsoft namespaces to deal with, as well as
		special tags.  The documentation for these is found in:
		The primary ones you'll probably encounter are o: and w:
	* Also described in this document are 
			<!--[if condition]>...<[endif]-->
			<![if condition]>...<![endif]>
		pairs.  These break most SGML
		and XML implementations.  (It would be good to think of a regex
		solution, since you'll probably need one to properly enclose
		the attributes anyway).

Once those issues are addressed, you SHOULD have valid XML.  If you don't,
chances are you haven't hit everything in this list :)

	Good Luck,
-----Original Message-----
From: xml-sig-admin@python.org [mailto:xml-sig-admin@python.org]On
Behalf Of Martin v. Loewis
Sent: Thursday, May 17, 2001 1:15 PM
To: Mike.Olson@fourthought.com
Cc: tony.mcdonald@ncl.ac.uk; Alexandre.Fayolle@logilab.fr;
Subject: Re: [XML-SIG] Advice needed: RTF->XML conversions

> Can you send me a sample of the word XML output, and the format your
> looking for.  You can probably do it with a stylesheet as long as what
> word spits out really is XML.

It isn't. Most notably, attribute values are not enclosed in quotes.
I found that sgmlop can parse what word produces, though.


XML-SIG maillist  -  XML-SIG@python.org