[Expat-discuss] & symbol workaround
Brad Causey
bradcausey at gmail.com
Wed Feb 4 22:50:39 CET 2009
All,
It sounds like the consensus is that I need to mod the incoming badly
formatted xml. This is my solution, and it worked for what I needed it for:
fileo = open(i,'r')
file = open('buffer.xml','w')
unfixml = fileo.read()
fixml = string.replace(unfixml,'&',' ')
file.write(fixml)
file.flush()
file.close()
file = open('buffer.xml','r')
Hopefully this helps some other poor lad who has crappy XML.
Thanks to all for the input!
-Brad
Rolf Ade wrote:
> Brad Causey wrote:
>
>> I completely agree. Unfortunately, I don't have control over the code that
>> generates these XML files.
>> If there isn't a better alternative, I'll have to create a duplicate of
>> EVERY file and parse each one at a text level to replace non-standard
>> characters with a escaped version. (doing this for < is nearly impossible)
>> This is something I am trying to avoid for obvious reasons. I don't like
>> non-standard XML any more than the next guy. (I've been through 3 different
>> python XML parsers trying to resolve this) But I'm running out of options.
>> Any ideas?
>>
>
> This is not the world of network protocols. The markup world is very
> strict about syntax. An entity is either a well-formed XML document
> or it is not, no fuss, even no doubt (belive it or not: at least at
> this basic level all major parsers out there agree, even in bizarre
> cases), no Robustness Principle.
>
> Something with a single (not escaped) ampersand in it isn't an XML
> document. Point.
>
> Even worser for you: I don't know any parser, that would let that pass.
>
> Raise the problem with your input data. Just that you've done it.
>
> If 'they' force you, to handle the problem I'm afraid, there is no
> other way, than to modify you input data, with a preprocessing step on
> a copy or, if the sizes are small, in memory, if you want to use an
> XML parser.
>
> I'm sorry, I haven't better news.
> rolf
>
>
>
>>
>> -Brad
>>
>>
>> On Wed, Feb 4, 2009 at 2:30 PM, Nick <nickmacd at xxx.com> wrote:
>>
>>
>>> amp is NOT valid as a standalone character in XML and needs to be
>>> escaped as & otherwise you are not parsing standard (and thus
>>> valid) XML files, but in fact parsing some other hybrid thing.
>>>
>>> Referring to the XML standard ( http://www.w3.org/TR/REC-xml/ ):
>>>
>>> The ampersand character (&) and the left angle bracket (<) MUST NOT
>>> appear in their literal form, except when used as markup delimiters,
>>> or within a comment, a processing instruction, or a CDATA section. If
>>> they are needed elsewhere, they MUST be escaped using either numeric
>>> character references or the strings " & " and " < "
>>> respectively. The right angle bracket (>) may be represented using the
>>> string " > ", and MUST, for compatibility, be escaped using either
>>> " > " or a character reference when it appears in the string " ]]>
>>> " in content, when that string is not marking the end of a CDATA
>>> section.
>>>
>>> So I would argue that you NEED to change the source files, in order to
>>> bring them into line with the standard.
>>>
>>> Nick
>>>
>>>
>>> On Wed, Feb 4, 2009 at 2:56 PM, Brad Causey <bradcausey at xxx.com<bradcausey at gmail.com>>
>>> wrote:
>>>
>>>> I am working on a Python script that parses around 6800 small xml files.
>>>> My code isn't pretty, as I am just testing a PoC at this point, but I
>>>>
>>> have
>>>
>>>> run into a problem. When the script hits the Ampersand symbol, it quits
>>>>
>>> with
>>>
>>>> "xml.parsers.expat.ExpatError: not well-formed (invalid token): line 28,
>>>> column 41"
>>>>
>>>> I am trying to figure out a way to work around this without modifying the
>>>> XML files themselves as these need to be preserved in the original
>>>>
>>> format.
>>> <NickMacD at gmail.com>
>>>
>> _______________________________________________
>> Expat-discuss mailing list
>> Expat-discuss at libexpat.org
>> http://mail.libexpat.org/mailman/listinfo/expat-discuss
>>
>>
>>
>
>
>
>
More information about the Expat-discuss
mailing list