[Expat-discuss] Special Characters part of the word

Nick MacDonald nickmacd at gmail.com
Wed Mar 31 23:43:43 CEST 2010


Junaid:

eXpat is performing precisely as it should in this respect and I would
advise you that you are attempting to parse invalid (mal-formed) XML
files.  There are a number of characters that a reserved in XML, most
particularly the '<' and the '&' that you are no doubt trying to use
incorrectly.  These characters need to be escaped in XML, such a &lt;
and &amp; .  Please check the XML spec at w3c.org for more details.

Additionally, eXpat does NOT guarantee how much data will be passed in
each call via one of its callbacks, and if the string is being broken
up by eXpat, you will need to write code to put it into your own
buffer as the pieces arrive.  There are completely logical reasons why
this would be necessary, such as when a body of text is broken up with
tags in the middle:

<tag1>
some body text
 <tag2/>
and yet more body text
</tag1>

As you can see, if you expected to have in a buffer the contents "some
body text and yet more body text" then you would need to do the
concatenation all on your own... or else potentially use a DOM parser
rather than a SAX parser like eXpat.

Also, please note that in the past it has been a common error for
people to assume that eXpat returns buffers that are zero
terminated... it does NOT do this... it passes the length, and you are
not allowed to use anything outside of the specified length.

Good luck with your project,
  Nick


On Wed, Mar 31, 2010 at 2:29 AM, Junaid Khokhar <mjkhokhar at gmail.com> wrote:
> I am trying to parse xml file with expat. I have different equations and
> keywords containing special characters ( < , > , & ) e.g. A<B & D>E.  Expat
> returns a word whenever it finds a word delimiter. It also considers
> aforementioned special characters as word delimiters. I need to get the
> whole equation to match. Is there any way i can specify word delimiters in
> Expat ?


More information about the Expat-discuss mailing list