[Expat-discuss] & symbol workaround

Wed Feb 4 22:50:39 CET 2009

All,

It sounds like the consensus is that I need to mod the incoming badly 
formatted xml. This is my solution, and it worked for what I needed it for:

    fileo = open(i,'r')
    file = open('buffer.xml','w')
    unfixml = fileo.read()
    fixml = string.replace(unfixml,'&',' ')
    file.write(fixml)
    file.flush()
    file.close()
    file = open('buffer.xml','r')

Hopefully this helps some other poor lad who has crappy XML.

Thanks to all for the input!

-Brad

Rolf Ade wrote:
> Brad Causey wrote:
>   
>> I completely agree. Unfortunately, I don't have control over the code that
>> generates these XML files.
>> If there isn't a better alternative, I'll have to create a duplicate of
>> EVERY file and parse each one at a text level to replace non-standard
>> characters with a escaped version. (doing this for < is nearly impossible)
>> This is something I am trying to avoid for obvious reasons. I don't like
>> non-standard XML any more than the next guy. (I've been through 3 different
>> python XML parsers trying to resolve this) But I'm running out of options.
>> Any ideas?
>>     
>
> This is not the world of network protocols. The markup world is very
> strict about syntax. An entity is either a well-formed XML document
> or it is not, no fuss, even no doubt (belive it or not: at least at
> this basic level all major parsers out there agree, even in bizarre
> cases), no Robustness Principle.
>
> Something with a single (not escaped) ampersand in it isn't an XML
> document. Point.
>
> Even worser for you: I don't know any parser, that would let that pass.
>
> Raise the problem with your input data. Just that you've done it.
>
> If 'they' force you, to handle the problem I'm afraid, there is no
> other way, than to modify you input data, with a preprocessing step on
> a copy or, if the sizes are small, in memory, if you want to use an
> XML parser.
>
> I'm sorry, I haven't better news.
> rolf
>
>
>   
>>
>> -Brad
>>
>>
>> On Wed, Feb 4, 2009 at 2:30 PM, Nick <nickmacd at xxx.com> wrote:
>>
>>     
>>> amp is NOT valid as a standalone character in XML and needs to be
>>> escaped as &amp; otherwise you are not parsing standard (and thus
>>> valid) XML files, but in fact parsing some other hybrid thing.
>>>
>>> Referring to the XML standard ( http://www.w3.org/TR/REC-xml/ ):
>>>
>>> The ampersand character (&) and the left angle bracket (<) MUST NOT
>>> appear in their literal form, except when used as markup delimiters,
>>> or within a comment, a processing instruction, or a CDATA section. If
>>> they are needed elsewhere, they MUST be escaped using either numeric
>>> character references or the strings " &amp;  " and " &lt;  "
>>> respectively. The right angle bracket (>) may be represented using the
>>> string " &gt;  ", and MUST, for compatibility, be escaped using either
>>> " &gt;  " or a character reference when it appears in the string " ]]>
>>>  " in content, when that string is not marking the end of a CDATA
>>> section.
>>>
>>> So I would argue that you NEED to change the source files, in order to
>>> bring them into line with the standard.
>>>
>>> Nick
>>>
>>>
>>> On Wed, Feb 4, 2009 at 2:56 PM, Brad Causey <bradcausey at xxx.com<bradcausey at gmail.com>>
>>> wrote:
>>>       
>>>> I am working on a Python script that parses around 6800 small xml files.
>>>> My code isn't pretty, as I am just testing a PoC at this point, but I
>>>>         
>>> have
>>>       
>>>> run into a problem. When the script hits the Ampersand symbol, it quits
>>>>         
>>> with
>>>       
>>>> "xml.parsers.expat.ExpatError: not well-formed (invalid token): line 28,
>>>> column 41"
>>>>
>>>> I am trying to figure out a way to work around this without modifying the
>>>> XML files themselves as these need to be preserved in the original
>>>>         
>>> format.
>>> <NickMacD at gmail.com>
>>>       
>> _______________________________________________
>> Expat-discuss mailing list
>> Expat-discuss at libexpat.org
>> http://mail.libexpat.org/mailman/listinfo/expat-discuss
>>
>>
>>     
>
>
>
>