[Expat-discuss] expat parsing destructively?

Mohun Biswas m_biswas at mailinator.com
Mon Oct 29 16:57:20 CET 2007


Nick MacDonald wrote:
> Mark:
> 
> You are of course absolutely correct in that regard... and I don't
> know what kind of internal memory management eXpat performs, so this
> may well be a very bad idea.  I do know, that in many cases, eXpat
> seems to return to you a copy of the same data it is parsing... I just
> don't know if it does that in all cases.  You could of course detect
> when the returned data did not reside within your supplied buffer,
> however its not quite clear to me what action you might take in such a
> case.

I wrote up the following little test case just to get an idea where the 
data returned by expat resides. It prints out the address range of the 
parsed document plus the address of each parsed attribute plus guesses 
at the approximate area where the stack and heap seem to be residing. 
The addresses of the attributes don't seem to be near the heap nor stack 
nor within the document itself. Damned if I know where they do live or 
how they get there. Yes, I should UTSL.

This also shows that the document survives the parse undamaged, so 
clearly null bytes are not being written to it.

I still say it would be really great if expat had an optional 
destructive-parsing mode. For one thing it seems in keeping with the 
spirit of SAX where you chew through the data once linearly and never 
look back (except with your own state variables of course).

To expand a little bit on my original posting, I'm actually parsing a 
document contained in a file. The file is mapped into memory using 
mmap() and then passed to XML_Parse as a single buffer. Currently it's 
mapped read-only (PROT_READ) but I've been thinking how cool it would be 
if I could turn on writable mode (PROT_WRITE plus MAP_PRIVATE) and let 
expat munge the mapping in place. Memory would still need to be 
allocated, since the kernel would do a copy-on-write for each modified 
page, but I think it's a very safe bet that the kernel's memory 
management code is faster and more robust than mine would be.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#include "expat.h"

static char *document = "<doc>\n<elem a1='v1' a2='v2' />\n</doc>\n";

static XML_Parser xp;

static void
enter_elem(void *data, const char *el, const char **attrs)
{
     data = data;
     int i;

     fprintf(stderr, "ENTER %s\n", el);

     for (i = 0; attrs[i]; i += 2)
         fprintf(stderr, "%p=%p stack=%p heap=%p\n",
             attrs[i], attrs[i+1], &i, malloc(32));
}

static void
leave_elem(void *data, const char *el)
{
     data = data;

     fprintf(stderr, "LEAVE %s\n", el);
}

int
main(void)
{
     fprintf(stderr, "%p <--> %p\n", document, document + strlen(document));

     xp = XML_ParserCreate(NULL);

     XML_SetElementHandler(xp, enter_elem, leave_elem);

     XML_Parse(xp, document, strlen(document), 1);

     fprintf(stderr, "%s", document);

     return 0;
}



More information about the Expat-discuss mailing list