From bradcausey at gmail.com Wed Feb 4 20:56:11 2009 From: bradcausey at gmail.com (Brad Causey) Date: Wed, 4 Feb 2009 13:56:11 -0600 Subject: [Expat-discuss] & symbol workaround Message-ID: <89f89940902041156h179005a0m14e1059f407de0b4@mail.gmail.com> Hi list, I am working on a Python script that parses around 6800 small xml files. My code isn't pretty, as I am just testing a PoC at this point, but I have run into a problem. When the script hits the Ampersand symbol, it quits with "xml.parsers.expat.ExpatError: not well-formed (invalid token): line 28, column 41" I am trying to figure out a way to work around this without modifying the XML files themselves as these need to be preserved in the original format. Here is my code: import xml.parsers.expat import string import os #var setup list = [] values = [] indexy = ('RulesVersion','AuditDate','ComputerName','UserName','UserDomain','OSName','OSServicePack','OSBuild','AntiVirusProduct','ExeVersion','SigsVersion','Active','Timeout','PasswordRequired','PasswordLength','Modem','Dialtone') out = open('test.txt','w') #handler functions def start_element(name, attrs): name = str(name) list.append(name) def end_element(name): name = str(name) list.append(name) def char_data(data): data = str(data) list.append(data) #file parsing xlist = os.popen (r"dir /od /a-d /b *.xml").read ().splitlines () for i in xlist: print i p = xml.parsers.expat.ParserCreate('ASCII') p.StartElementHandler = start_element p.EndElementHandler = end_element p.CharacterDataHandler = char_data values.append(i) file = open(i,'r') p.ParseFile(file) for item in indexy: check = item try: item = list.index(item) if check == 'AntiVirusProduct': values.append(list[item+3]) elif check == 'Modem': values.append(list[item+3]) else: values.append(list[item+1]) except: values.append('NOT FOUND') file.close() print values list =[] values =[] -B From bradcausey at gmail.com Wed Feb 4 21:40:27 2009 From: bradcausey at gmail.com (Brad Causey) Date: Wed, 4 Feb 2009 14:40:27 -0600 Subject: [Expat-discuss] & symbol workaround In-Reply-To: References: <89f89940902041156h179005a0m14e1059f407de0b4@mail.gmail.com> Message-ID: <89f89940902041240y5964ec26x28fa12c6011234f9@mail.gmail.com> Nick, I completely agree. Unfortunately, I don't have control over the code that generates these XML files. If there isn't a better alternative, I'll have to create a duplicate of EVERY file and parse each one at a text level to replace non-standard characters with a escaped version. (doing this for < is nearly impossible) This is something I am trying to avoid for obvious reasons. I don't like non-standard XML any more than the next guy. (I've been through 3 different python XML parsers trying to resolve this) But I'm running out of options. Any ideas? -Brad On Wed, Feb 4, 2009 at 2:30 PM, Nick wrote: > amp is NOT valid as a standalone character in XML and needs to be > escaped as & otherwise you are not parsing standard (and thus > valid) XML files, but in fact parsing some other hybrid thing. > > Referring to the XML standard ( http://www.w3.org/TR/REC-xml/ ): > > The ampersand character (&) and the left angle bracket (<) MUST NOT > appear in their literal form, except when used as markup delimiters, > or within a comment, a processing instruction, or a CDATA section. If > they are needed elsewhere, they MUST be escaped using either numeric > character references or the strings " & " and " < " > respectively. The right angle bracket (>) may be represented using the > string " > ", and MUST, for compatibility, be escaped using either > " > " or a character reference when it appears in the string " ]]> > " in content, when that string is not marking the end of a CDATA > section. > > So I would argue that you NEED to change the source files, in order to > bring them into line with the standard. > > Nick > > > On Wed, Feb 4, 2009 at 2:56 PM, Brad Causey > > wrote: > > I am working on a Python script that parses around 6800 small xml files. > > My code isn't pretty, as I am just testing a PoC at this point, but I > have > > run into a problem. When the script hits the Ampersand symbol, it quits > with > > "xml.parsers.expat.ExpatError: not well-formed (invalid token): line 28, > > column 41" > > > > I am trying to figure out a way to work around this without modifying the > > XML files themselves as these need to be preserved in the original > format. > From karl at waclawek.net Wed Feb 4 22:09:15 2009 From: karl at waclawek.net (Karl Waclawek) Date: Wed, 04 Feb 2009 16:09:15 -0500 Subject: [Expat-discuss] & symbol workaround In-Reply-To: <89f89940902041240y5964ec26x28fa12c6011234f9@mail.gmail.com> References: <89f89940902041156h179005a0m14e1059f407de0b4@mail.gmail.com> <89f89940902041240y5964ec26x28fa12c6011234f9@mail.gmail.com> Message-ID: <498A03FB.9000101@waclawek.net> Brad Causey wrote: > Nick, > > I completely agree. Unfortunately, I don't have control over the code that > generates these XML files. > If there isn't a better alternative, I'll have to create a duplicate of > EVERY file and parse each one at a text level to replace non-standard > characters with a escaped version. (doing this for < is nearly impossible) > This is something I am trying to avoid for obvious reasons. I don't like > non-standard XML any more than the next guy. (I've been through 3 different > python XML parsers trying to resolve this) But I'm running out of options. > Any ideas? > > There is no XML parser that will accept these files, as they are not well-formed. Strictly speaking, they are not XML files at all. You could try to fix each block as you are passing them to the Expat parser. Not sure how the Python wrapper works, though. Karl From rolf at pointsman.de Wed Feb 4 22:15:23 2009 From: rolf at pointsman.de (Rolf Ade) Date: Wed, 04 Feb 2009 22:15:23 +0100 Subject: [Expat-discuss] & symbol workaround In-Reply-To: <89f89940902041240y5964ec26x28fa12c6011234f9@mail.gmail.com> References: <89f89940902041156h179005a0m14e1059f407de0b4@mail.gmail.com> <89f89940902041240y5964ec26x28fa12c6011234f9@mail.gmail.com> Message-ID: <498A056B.808@pointsman.de> Brad Causey wrote: > I completely agree. Unfortunately, I don't have control over the code that > generates these XML files. > If there isn't a better alternative, I'll have to create a duplicate of > EVERY file and parse each one at a text level to replace non-standard > characters with a escaped version. (doing this for < is nearly impossible) > This is something I am trying to avoid for obvious reasons. I don't like > non-standard XML any more than the next guy. (I've been through 3 different > python XML parsers trying to resolve this) But I'm running out of options. > Any ideas? This is not the world of network protocols. The markup world is very strict about syntax. An entity is either a well-formed XML document or it is not, no fuss, even no doubt (belive it or not: at least at this basic level all major parsers out there agree, even in bizarre cases), no Robustness Principle. Something with a single (not escaped) ampersand in it isn't an XML document. Point. Even worser for you: I don't know any parser, that would let that pass. Raise the problem with your input data. Just that you've done it. If 'they' force you, to handle the problem I'm afraid, there is no other way, than to modify you input data, with a preprocessing step on a copy or, if the sizes are small, in memory, if you want to use an XML parser. I'm sorry, I haven't better news. rolf > > > > -Brad > > > On Wed, Feb 4, 2009 at 2:30 PM, Nick wrote: > >> amp is NOT valid as a standalone character in XML and needs to be >> escaped as & otherwise you are not parsing standard (and thus >> valid) XML files, but in fact parsing some other hybrid thing. >> >> Referring to the XML standard ( http://www.w3.org/TR/REC-xml/ ): >> >> The ampersand character (&) and the left angle bracket (<) MUST NOT >> appear in their literal form, except when used as markup delimiters, >> or within a comment, a processing instruction, or a CDATA section. If >> they are needed elsewhere, they MUST be escaped using either numeric >> character references or the strings " & " and " < " >> respectively. The right angle bracket (>) may be represented using the >> string " > ", and MUST, for compatibility, be escaped using either >> " > " or a character reference when it appears in the string " ]]> >> " in content, when that string is not marking the end of a CDATA >> section. >> >> So I would argue that you NEED to change the source files, in order to >> bring them into line with the standard. >> >> Nick >> >> >> On Wed, Feb 4, 2009 at 2:56 PM, Brad Causey > >> wrote: >>> I am working on a Python script that parses around 6800 small xml files. >>> My code isn't pretty, as I am just testing a PoC at this point, but I >> have >>> run into a problem. When the script hits the Ampersand symbol, it quits >> with >>> "xml.parsers.expat.ExpatError: not well-formed (invalid token): line 28, >>> column 41" >>> >>> I am trying to figure out a way to work around this without modifying the >>> XML files themselves as these need to be preserved in the original >> format. >> > _______________________________________________ > Expat-discuss mailing list > Expat-discuss at libexpat.org > http://mail.libexpat.org/mailman/listinfo/expat-discuss > > From regis.st-gelais at laubrass.com Wed Feb 4 21:57:51 2009 From: regis.st-gelais at laubrass.com (=?iso-8859-1?Q?R=E9gis_St-Gelais_=28Laubrass=29?=) Date: Wed, 4 Feb 2009 15:57:51 -0500 Subject: [Expat-discuss] & symbol workaround References: <89f89940902041156h179005a0m14e1059f407de0b4@mail.gmail.com> <89f89940902041240y5964ec26x28fa12c6011234f9@mail.gmail.com> Message-ID: <3F658126F53541B89DAF078C35C1DDBA@laubrasssag4> Maybe you could preprocess the xml file as you read it before passing the buffer to the XML parser Regis St-Gelais www.laubrass.com ----- Original Message ----- From: Brad Causey To: expat-discuss Sent: Wednesday, February 04, 2009 3:40 PM Subject: Re: [Expat-discuss] & symbol workaround Nick, I completely agree. Unfortunately, I don't have control over the code that generates these XML files. If there isn't a better alternative, I'll have to create a duplicate of EVERY file and parse each one at a text level to replace non-standard characters with a escaped version. (doing this for < is nearly impossible) This is something I am trying to avoid for obvious reasons. I don't like non-standard XML any more than the next guy. (I've been through 3 different python XML parsers trying to resolve this) But I'm running out of options. Any ideas? -Brad On Wed, Feb 4, 2009 at 2:30 PM, Nick wrote: > amp is NOT valid as a standalone character in XML and needs to be > escaped as & otherwise you are not parsing standard (and thus > valid) XML files, but in fact parsing some other hybrid thing. > > Referring to the XML standard ( http://www.w3.org/TR/REC-xml/ ): > > The ampersand character (&) and the left angle bracket (<) MUST NOT > appear in their literal form, except when used as markup delimiters, > or within a comment, a processing instruction, or a CDATA section. If > they are needed elsewhere, they MUST be escaped using either numeric > character references or the strings " & " and " < " > respectively. The right angle bracket (>) may be represented using the > string " > ", and MUST, for compatibility, be escaped using either > " > " or a character reference when it appears in the string " ]]> > " in content, when that string is not marking the end of a CDATA > section. > > So I would argue that you NEED to change the source files, in order to > bring them into line with the standard. > > Nick > > > On Wed, Feb 4, 2009 at 2:56 PM, Brad Causey > > wrote: > > I am working on a Python script that parses around 6800 small xml files. > > My code isn't pretty, as I am just testing a PoC at this point, but I > have > > run into a problem. When the script hits the Ampersand symbol, it quits > with > > "xml.parsers.expat.ExpatError: not well-formed (invalid token): line 28, > > column 41" > > > > I am trying to figure out a way to work around this without modifying the > > XML files themselves as these need to be preserved in the original > format. > _______________________________________________ Expat-discuss mailing list Expat-discuss at libexpat.org http://mail.libexpat.org/mailman/listinfo/expat-discuss From bradcausey at gmail.com Wed Feb 4 22:50:39 2009 From: bradcausey at gmail.com (Brad Causey) Date: Wed, 04 Feb 2009 15:50:39 -0600 Subject: [Expat-discuss] & symbol workaround In-Reply-To: <498A056B.808@pointsman.de> References: <89f89940902041156h179005a0m14e1059f407de0b4@mail.gmail.com> <89f89940902041240y5964ec26x28fa12c6011234f9@mail.gmail.com> <498A056B.808@pointsman.de> Message-ID: <498A0DAF.6040703@gmail.com> All, It sounds like the consensus is that I need to mod the incoming badly formatted xml. This is my solution, and it worked for what I needed it for: fileo = open(i,'r') file = open('buffer.xml','w') unfixml = fileo.read() fixml = string.replace(unfixml,'&',' ') file.write(fixml) file.flush() file.close() file = open('buffer.xml','r') Hopefully this helps some other poor lad who has crappy XML. Thanks to all for the input! -Brad Rolf Ade wrote: > Brad Causey wrote: > >> I completely agree. Unfortunately, I don't have control over the code that >> generates these XML files. >> If there isn't a better alternative, I'll have to create a duplicate of >> EVERY file and parse each one at a text level to replace non-standard >> characters with a escaped version. (doing this for < is nearly impossible) >> This is something I am trying to avoid for obvious reasons. I don't like >> non-standard XML any more than the next guy. (I've been through 3 different >> python XML parsers trying to resolve this) But I'm running out of options. >> Any ideas? >> > > This is not the world of network protocols. The markup world is very > strict about syntax. An entity is either a well-formed XML document > or it is not, no fuss, even no doubt (belive it or not: at least at > this basic level all major parsers out there agree, even in bizarre > cases), no Robustness Principle. > > Something with a single (not escaped) ampersand in it isn't an XML > document. Point. > > Even worser for you: I don't know any parser, that would let that pass. > > Raise the problem with your input data. Just that you've done it. > > If 'they' force you, to handle the problem I'm afraid, there is no > other way, than to modify you input data, with a preprocessing step on > a copy or, if the sizes are small, in memory, if you want to use an > XML parser. > > I'm sorry, I haven't better news. > rolf > > > >> >> -Brad >> >> >> On Wed, Feb 4, 2009 at 2:30 PM, Nick wrote: >> >> >>> amp is NOT valid as a standalone character in XML and needs to be >>> escaped as & otherwise you are not parsing standard (and thus >>> valid) XML files, but in fact parsing some other hybrid thing. >>> >>> Referring to the XML standard ( http://www.w3.org/TR/REC-xml/ ): >>> >>> The ampersand character (&) and the left angle bracket (<) MUST NOT >>> appear in their literal form, except when used as markup delimiters, >>> or within a comment, a processing instruction, or a CDATA section. If >>> they are needed elsewhere, they MUST be escaped using either numeric >>> character references or the strings " & " and " < " >>> respectively. The right angle bracket (>) may be represented using the >>> string " > ", and MUST, for compatibility, be escaped using either >>> " > " or a character reference when it appears in the string " ]]> >>> " in content, when that string is not marking the end of a CDATA >>> section. >>> >>> So I would argue that you NEED to change the source files, in order to >>> bring them into line with the standard. >>> >>> Nick >>> >>> >>> On Wed, Feb 4, 2009 at 2:56 PM, Brad Causey > >>> wrote: >>> >>>> I am working on a Python script that parses around 6800 small xml files. >>>> My code isn't pretty, as I am just testing a PoC at this point, but I >>>> >>> have >>> >>>> run into a problem. When the script hits the Ampersand symbol, it quits >>>> >>> with >>> >>>> "xml.parsers.expat.ExpatError: not well-formed (invalid token): line 28, >>>> column 41" >>>> >>>> I am trying to figure out a way to work around this without modifying the >>>> XML files themselves as these need to be preserved in the original >>>> >>> format. >>> >>> >> _______________________________________________ >> Expat-discuss mailing list >> Expat-discuss at libexpat.org >> http://mail.libexpat.org/mailman/listinfo/expat-discuss >> >> >> > > > > From regis.st-gelais at laubrass.com Wed Feb 4 22:30:24 2009 From: regis.st-gelais at laubrass.com (=?iso-8859-1?Q?R=E9gis_St-Gelais_=28Laubrass=29?=) Date: Wed, 4 Feb 2009 16:30:24 -0500 Subject: [Expat-discuss] Test Message-ID: <544B2A0D9F184F50AA98B8D7C9C445EF@laubrasssag4> My last post did not show. This is a test. Please disregard. From weigelt at metux.de Mon Feb 23 20:50:24 2009 From: weigelt at metux.de (Enrico Weigelt) Date: Mon, 23 Feb 2009 20:50:24 +0100 Subject: [Expat-discuss] Test In-Reply-To: <544B2A0D9F184F50AA98B8D7C9C445EF@laubrasssag4> References: <544B2A0D9F184F50AA98B8D7C9C445EF@laubrasssag4> Message-ID: <20090223195024.GC2905@nibiru.local> * R?gis St-Gelais (Laubrass) wrote: > My last post did not show. > This is a test. ACK. -- --------------------------------------------------------------------- Enrico Weigelt == metux IT service - http://www.metux.de/ --------------------------------------------------------------------- Please visit the OpenSource QM Taskforce: http://wiki.metux.de/public/OpenSource_QM_Taskforce Patches / Fixes for a lot dozens of packages in dozens of versions: http://patches.metux.de/ --------------------------------------------------------------------- From weigelt at metux.de Mon Feb 23 20:55:41 2009 From: weigelt at metux.de (Enrico Weigelt) Date: Mon, 23 Feb 2009 20:55:41 +0100 Subject: [Expat-discuss] & symbol workaround In-Reply-To: <498A0DAF.6040703@gmail.com> References: <89f89940902041156h179005a0m14e1059f407de0b4@mail.gmail.com> <89f89940902041240y5964ec26x28fa12c6011234f9@mail.gmail.com> <498A056B.808@pointsman.de> <498A0DAF.6040703@gmail.com> Message-ID: <20090223195541.GD2905@nibiru.local> * Brad Causey wrote: > All, > > It sounds like the consensus is that I need to mod the incoming badly > formatted xml. This is my solution, and it worked for what I needed it for: > > fileo = open(i,'r') > file = open('buffer.xml','w') > unfixml = fileo.read() > fixml = string.replace(unfixml,'&',' ') ^^^^^^^ This will make trouble if you get some escaped symbol (eg. &). So, you'll have to find the &'s, check what comes after and then decide whether to fixup or let it pass. BTW: is there any way for hooking into the parser (some callback) to catch those errors and then continue parsing ? That would allow building an auto-fixing parser, especially for cases like Brad's. cu -- --------------------------------------------------------------------- Enrico Weigelt == metux IT service - http://www.metux.de/ --------------------------------------------------------------------- Please visit the OpenSource QM Taskforce: http://wiki.metux.de/public/OpenSource_QM_Taskforce Patches / Fixes for a lot dozens of packages in dozens of versions: http://patches.metux.de/ --------------------------------------------------------------------- From bradcausey at gmail.com Tue Feb 24 01:33:21 2009 From: bradcausey at gmail.com (Brad Causey) Date: Mon, 23 Feb 2009 18:33:21 -0600 Subject: [Expat-discuss] & symbol workaround In-Reply-To: <20090223195541.GD2905@nibiru.local> References: <89f89940902041156h179005a0m14e1059f407de0b4@mail.gmail.com> <89f89940902041240y5964ec26x28fa12c6011234f9@mail.gmail.com> <498A056B.808@pointsman.de> <498A0DAF.6040703@gmail.com> <20090223195541.GD2905@nibiru.local> Message-ID: <89f89940902231633m32e37098o7be93ecfa4022385@mail.gmail.com> cu, This will make trouble if you get some escaped symbol (eg. &). > So, you'll have to find the &'s, check what comes after and then > decide whether to fixup or let it pass. > Agreed. I further tuned the code later, but mainly wanted to give the list an idea of my work-around. Also, I could be particularly lax in my find/replace because all fields in my case that could contain ampersands were getting thrown out of the report anyway. Lucky me! ;-p > > BTW: is there any way for hooking into the parser (some callback) > to catch those errors and then continue parsing ? > That would allow building an auto-fixing parser, especially for > cases like Brad's. > Although python allows you to 'modify' the instance of the object, and any part of it, I think its just easier to make a one time 'workaround' UDF and move on. I guess it depends on how heavily you depended on performance and other variables. Building from your idea.... I think the community could benefit greatly from a parser that is less strict than the ones out there today. Although XML does have strict rules, many companies/programs/tools adapt unusual implementations of it. I know, I know, everyone is going to say 'well they shouldn't do that' and 'then its not really XML' but they do, and it is closer to XML than any other text format. Thoughts? -Brad From weigelt at metux.de Tue Feb 24 02:11:00 2009 From: weigelt at metux.de (Enrico Weigelt) Date: Tue, 24 Feb 2009 02:11:00 +0100 Subject: [Expat-discuss] git mirror Message-ID: <20090224011059.GE2905@nibiru.local> Hi folks, I've set up an automatic mirror of expat cvs to git: git://git.metux.de/expat/expat-cvsmirror.git My current works will be available at: git://git.metux.de/expat/expat.git cu -- --------------------------------------------------------------------- Enrico Weigelt == metux IT service - http://www.metux.de/ --------------------------------------------------------------------- Please visit the OpenSource QM Taskforce: http://wiki.metux.de/public/OpenSource_QM_Taskforce Patches / Fixes for a lot dozens of packages in dozens of versions: http://patches.metux.de/ --------------------------------------------------------------------- From weigelt at metux.de Tue Feb 24 03:22:03 2009 From: weigelt at metux.de (Enrico Weigelt) Date: Tue, 24 Feb 2009 03:22:03 +0100 Subject: [Expat-discuss] HAVE_UNISTD_H needed ? Message-ID: <20090224022203.GF2905@nibiru.local> Hi folks, I just wonder whether the configure check for unistd.h / the HAVE_UNISTD_H symbol is really necessary. Is there any system (supported by expat) which does NOT have unistd.h ? cu -- --------------------------------------------------------------------- Enrico Weigelt == metux IT service - http://www.metux.de/ --------------------------------------------------------------------- Please visit the OpenSource QM Taskforce: http://wiki.metux.de/public/OpenSource_QM_Taskforce Patches / Fixes for a lot dozens of packages in dozens of versions: http://patches.metux.de/ --------------------------------------------------------------------- From Mark.Williams at techop.co.uk Tue Feb 24 10:02:27 2009 From: Mark.Williams at techop.co.uk (Mark Williams) Date: Tue, 24 Feb 2009 09:02:27 +0000 Subject: [Expat-discuss] & symbol workaround In-Reply-To: <20090223195541.GD2905@nibiru.local> References: <89f89940902041156h179005a0m14e1059f407de0b4@mail.gmail.com><89f89940902041240y5964ec26x28fa12c6011234f9@mail.gmail.com><498A056B.808@pointsman.de> <498A0DAF.6040703@gmail.com> <20090223195541.GD2905@nibiru.local> Message-ID: > * Brad Causey wrote: > > All, > > > > It sounds like the consensus is that I need to mod the > incoming badly > > formatted xml. This is my solution, and it worked for what > I needed it for: > > > > fileo = open(i,'r') > > file = open('buffer.xml','w') > > unfixml = fileo.read() > > fixml = string.replace(unfixml,'&',' ') > ^^^^^^^ > > This will make trouble if you get some escaped symbol (eg. &). > So, you'll have to find the &'s, check what comes after and then > decide whether to fixup or let it pass. > > BTW: is there any way for hooking into the parser (some callback) > to catch those errors and then continue parsing ? > That would allow building an auto-fixing parser, especially for > cases like Brad's. It's not clear if you have any control on the "XML" that is the input to your program. If so, get it changed to be valid XML. On awkward data you can encode it to make it valid (I encode binary data as hexadecimal strings). Otherwise you need to preprocess the data and convert in into valid XML as others have said. I like Enrico's idea of having error callbacks. This could be useful in many situations. Mark.