[Expat-discuss] Pull API Status?

Tue Mar 11 11:09:49 EST 2003

Karl,

Thanks for the reply.  Let me make some comments and questions inline to
make sure I understand you.

> > I was browsing the archives and came across a proposal to 
> implement a pull
> > based API on top of expat's XML_Parse+callbacks API.  This 
> is something I
> > could use for a current project at work, and I'd be willing 
> to tackle it if
> > no one is actively working on this.  
> 
> No one is currently working on it, as we plan to release 
> Expat 2.0 first.

This is understandable.  I didn't read the roadmap closely enough, and we
going on the August posts referring to suspension perhaps being part of
1.96.xxx

> - Expat is already Pull based internally. So a lot of code 
> can be re-used.
>   One does not need to completely re-implement the layer on 
> top of xmltok.
> - The main things to change are:
>   - instead of "pushing" buffers (with XML_ParseBuffer), have the main
>     parsing loop pull buffers with an XML_GetNextBuffer callback.

Just to get my terminology straight... By "main parsing loop" you mean any
of the prologProcessor, contentProcessor, externalEntityProcessor family of
functions, yes?

So, contentProcessor (or doContent) would change to call the
XML_GetNextBuffer callback whenever they needed more bytes.

Pull mode would work by calling XML_NextNode() (or whatever we name it) and
would -never- call XML_Parse or XML_ParseBuffer.

Push mode would continue to work in that XML_Parse and XML_ParseBuffer would
provide pointers and lengths to the main parser structure just like they do
today, and the absense of a XML_GetNextBuffer callback would ensure that
control returns to XML_Parse callers just like it does now.  

>   - add return codes to all the callbacks (like XML_SKIP, 
> XML_USE, XML_ERROR, ...)
>   - supply internal callbacks which perform the PULL API specific
>     data preparation and also do any required filtering

Return codes seem to be the cleanest way to do this, especially for an ExPat
2.1 or 3.0 release.  
However, should we preserve compatibility with "push-era" callbacks better
by providing a XML_SetNodeHandling() function to be used in a callback where
the user would send the XML_SKIP, XML_USE flags appropriately?

The default could always be "XML_SKIP".  Callbacks can continue to have void
return type and for push mode users, the main loop would continue until the
buffer is exhausted, just like today.

> - The Next() function would simply call the main parsing loop 
> which returns
>   when an (internal) callback returns XML_USE. The data to be 
> reported would
>   be stored in some fields in the Parser structure.

Agreed.

> In addition we would also want to improve the API with regards to
> complete entity reporting (currently the same restrictions as SAX2)
> and namespace reporting (it seems better to return names as separate
> localName, prefix and uri parameters).

My project does not have complex namespace reporting needs, but I would
certainly want any new pull api functions to provide as much namespace
support as the rest of Expat.

> And, of course, it should still be possible to use Expat in Push mode.

Absolutely.

> > After I hear back on the status, I have a couple of 
> specific questions for
> > how people would like to handle character text nodes in a 
> pull based API.
> 
> I think if we want your API re-usable we should put a lot of 
> thought into it.

I agree, especially with regards to namespace handling.  However, I would
like to scope out a level of effort based on the following first phase
requirements:

1.  Provide pull parsing that can handle element, attribute, and text nodes
for well formed XML documents.  Use a minimal subset of the Java API at
www.xmlpull.org as a guideline Sample API.  Subset of this API is included
at the bottom of the email.

2.  The real assessment is this - how extensive are the changes to the main
parsing loop(s)?  They have to :
	A.  Handle pulling buffers from users when needed.  If this logic
can stay "outside" in XML_Next(), it will be helpful, I would imagine.
	B.  Handle "XML_USE/XML_SKIP/XML_ERROR" values set during callbacks.
	C.  If XML_USE is returned, we have to store information about the
current node for retrieval and return.

3.  Default callbacks that match the capabilities of the Pull API should be
provided.  That is, "XML_USE" should be the default for StartElement,
EndElement, and Character callbacks.

Seeing the roadmap, I can see why the main expat distribution is holding off
on these things...  My goal would be to provide a proof of concept
implementation that minimizes impact to the main parsing loops so that any
changes to the main parsing loop that occur during the 1.95 -> 2.0
transition are easy to merge with changes necessary for pull parsing.

Any feedback would be appreciated!

Thanks,
Dave

Minimal API for proof of concept:

typedef enum 
{
	START_DOCUMENT,
	END_DOCUMENT,
	START_TAG,
	END_TAG,
	TEXT,
} NodeType;

typedef enum
{
	XML_USE,
	XML_SKIP,
	XML_ERROR,
} NodeHandling;

/* Main entry point for pull parsing.  Returns the NodeType of the most
recently parsed node where
   XML_USE was set in a callback */
NodeType XML_Next(XML_Parser p);

/* Function used in callbacks to set pull parse handling of the node */
void XML_SetNodeHandling(XML_Parser p,NodeHandling handling);

/* Functions to retrieve information about the current node. */
char *XML_GetName();		/* Returns the name of the next node */
char *XML_GetText();		/* Returns the text content of the current
node */

int XML_GetAttributeCount();  
char *XML_GetAttributeName(int index);
char *XML_GetAttributeValue(int index);

/* Clearly the above family of functions would grow/change to handle
namespaces, CDATA, comments, etc. */

/* Pull buffer management */
/* User provided function will provide a pointer to additional buffer space
and the length
   of that buffer.  If the user sets *nextBufferPtr to NULL, this signals
the end of input. */

typedef void(*XML_GetNextBufferHandler)(void *userData, char
**nextBufferPtr, int *nextBufferLen);
XML_SetGetNextBufferHandler(XML_Parser p,XML_GetNextBufferHandler handler);