[Expat-discuss] expat parsing destructively?
Mohun Biswas
m_biswas at mailinator.com
Mon Oct 29 16:57:20 CET 2007
Nick MacDonald wrote:
> Mark:
>
> You are of course absolutely correct in that regard... and I don't
> know what kind of internal memory management eXpat performs, so this
> may well be a very bad idea. I do know, that in many cases, eXpat
> seems to return to you a copy of the same data it is parsing... I just
> don't know if it does that in all cases. You could of course detect
> when the returned data did not reside within your supplied buffer,
> however its not quite clear to me what action you might take in such a
> case.
I wrote up the following little test case just to get an idea where the
data returned by expat resides. It prints out the address range of the
parsed document plus the address of each parsed attribute plus guesses
at the approximate area where the stack and heap seem to be residing.
The addresses of the attributes don't seem to be near the heap nor stack
nor within the document itself. Damned if I know where they do live or
how they get there. Yes, I should UTSL.
This also shows that the document survives the parse undamaged, so
clearly null bytes are not being written to it.
I still say it would be really great if expat had an optional
destructive-parsing mode. For one thing it seems in keeping with the
spirit of SAX where you chew through the data once linearly and never
look back (except with your own state variables of course).
To expand a little bit on my original posting, I'm actually parsing a
document contained in a file. The file is mapped into memory using
mmap() and then passed to XML_Parse as a single buffer. Currently it's
mapped read-only (PROT_READ) but I've been thinking how cool it would be
if I could turn on writable mode (PROT_WRITE plus MAP_PRIVATE) and let
expat munge the mapping in place. Memory would still need to be
allocated, since the kernel would do a copy-on-write for each modified
page, but I think it's a very safe bet that the kernel's memory
management code is faster and more robust than mine would be.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "expat.h"
static char *document = "<doc>\n<elem a1='v1' a2='v2' />\n</doc>\n";
static XML_Parser xp;
static void
enter_elem(void *data, const char *el, const char **attrs)
{
data = data;
int i;
fprintf(stderr, "ENTER %s\n", el);
for (i = 0; attrs[i]; i += 2)
fprintf(stderr, "%p=%p stack=%p heap=%p\n",
attrs[i], attrs[i+1], &i, malloc(32));
}
static void
leave_elem(void *data, const char *el)
{
data = data;
fprintf(stderr, "LEAVE %s\n", el);
}
int
main(void)
{
fprintf(stderr, "%p <--> %p\n", document, document + strlen(document));
xp = XML_ParserCreate(NULL);
XML_SetElementHandler(xp, enter_elem, leave_elem);
XML_Parse(xp, document, strlen(document), 1);
fprintf(stderr, "%s", document);
return 0;
}
More information about the Expat-discuss
mailing list