[Expat-discuss] Summary of Pull API thoughts

Mon Mar 24 09:16:31 EST 2003

I've tried to collect the many good ideas provided by Karl Waclawek for the
design of a Pull API on top of the expat.  

Any feedback (especially on omissions) is appreciated!  

Thanks,
Dave

-------------- next part --------------
0.0 This document attempts to capture design decisions and issues with
designing a Pull API for the Expat parser. 

1.0 Pull API Overview 

The Expat Pull API defines a new set of functions which allow a client
application to process nodes of an XML document one at a time by
"pulling" them in order via calls to a "Next()" function. The pull API
will be built "on top" of the existing API in the sense that its
implementation will include callbacks for startElementHandler,
endElementHandler, etc. These built in handlers will manage the state
necessary to return nodes to the caller of Next(). Next() will also ask
the client application for more raw data as necessary via a 
"GetNextBuffer" callback. 

A side effect of the pull API implementation is that it provides a
graceful means of interrupting XML parsing from inside a callback, even
when operating in "push" mode.

2.0 Pull API Specifics 

/* 
 * XML_GetBufferHandler 
 * 
 * Callback prototype used by PULL api to ask the user for more data. 
 * 
 * XML_SetGetBufferHandler 
 * 
 */ 

typedef void(*XML_GetBufferHandler)(XML_Parser parser, char **bufferPtr,int
*bufferLen,int *isFinal); 

XMLPARSEAPI(void) XML_SetGetBufferHandler(XML_Parser p,XML_GetBufferHandler handler); 

New functions: 
/* XML_PullParserCreate 
 * 
 * Creates a parser based on encoding, installs internal callbacks for 
 * implementing Pull based API
 */ 
XML_Parser XML_PullParserCreate(const XML_Char *encoding); 

/* XML_Next 
 * 
 * Does enough parsing to process the next node of the document. 
 * Pulls in new data if current buffer is exhausted. 
 * 
 */ 
enum XML_Status XML_Next(XML_Parser p); 

Higher level XML_Next() functions are possible which allow the parser to
move ahead to specific nodes in the document.  For example, move to the
next element, or move to the next element named "Foo".

enum XML_Status XML_Next_______(XML_Parser p,...);

/* XML_GetNodeInfo 
 * 
 * Returns a pointer to a XML_NodeInfo structure containing name, character data, 
 * etc. information about the current node 
 * 
 */ 

XML_NodeInfo *XML_GetNodeInfo(XML_Parser p); 

struct XML_NodeInfo 
{ 
  XML_NodeType type; 
  char *name; 
  char *data; 
  int data_len; 
  /* More Here... */ 
} 

3.0 Built-in callbacks and internal callbacks 

Callbacks will return "handling instructions" to the main processing
loop. The handling instructions are one of: 

XML_SKIP - Do not store node information in XML_NodeInfo, but continue
parsing. This is the default, and is compatible with existing parsing
code. 

XML_USE - Store node information in XML_NodeInfo, and return to the
caller of the parsing function. 

XML_STOP - Do not store node information in XML_NodeInfo, and return to
the caller of the parsing function.  Parsing is resumable. 

XML_ERR - Do not store node information in XML_NodeInfo, and return a
status of "XML_STATUS_ERROR" to the caller.  XML_GetErrorCode will
return "XML_USER_ERROR".  Parsing is resumable.

Internal callbacks will be automatically installed for some (most? all?)
of expat's callbacks. StartElementHandler and EndElementHandler will
certainly be among the set of internal callbacks. These callbacks will
capture node information and will return XML_SKIP or XML_USE depending
on the current call to the XML_Next() family of functions.  For simple
pull parsing, the user will not have to provide any callbacks except the
'GetNextBuffer' callback.

User callbacks can still be installed in conjunction with the internal
callbacks above for advanced processing.  The internal callbacks will
perform their own processing as appropriate, and then forward the call
to the user callback.  The user callback could then apply additional
processing.  For instance, the user callback could convert a "XML_USE"
state to "XML_SKIP" for a node not of interest to an application.

4.0 Existing API Issues/Changes 

Some other future changes for expat may impact the design &
implementation of pull parsing.

4.1 Namespace reporting 

Expat may return names as separate localName, prefix and uri parameters.
If callback signatures are changing already to pass namespace
information, we can have more options in extending callbacks to return
our node handling codes. 

4.2 Entity expansion

Due to the nature of reporting attributes values as one chunk
of data, entities within the value are silently expanded.
The same applies to parameter entities in entity values.  If entities
are expanded differently, we may have to add new node types to allow a
pull based parser to move through attribute values.

4.3 Additional internal changes

XML_NodeInfo sturcture will probably be added as an internal data
structure that is filled as each node is reached.

Additional callback function pointers are needed for forwarding from
internal pull callbacks to client callbacks.

5.0 Buffer Management 

When the parser calls the user's XML_GetBufferHandler, it expects the
user to return a pointer to buffer space containing new data.  

The user may accomplish this by allocating a new buffer via
XML_GetBuffer and then copying/reading data into that buffer, or by
referring to one of its own buffers.  If it refers to its own buffer, a
client application must ensure that this buffer remains valid until the
next call to XML_GetBufferHandler or until parsing is complete.

The parser must be able to determine when the user provided buffer was
created via XML_GetBuffer.  If the buffer was -not- created by
XML_GetBuffer, then the parser will be responsible for preserving any
partial tokens from the last buffer and prepending them to a user buffer
when parsing continues.

The main processing loops request more data by returning a new error
code, XML_ERROR_BUFFER to their caller (i.e. XML_Next()).

"Push mode" buffer management could be changed to work more like the pull
model (using the XML_GetBufferHandler() callback instead of the 

6.0 Internal Complications 

When XML_USE is returned and parsing is interrupted, some cleanup may be
deferred until the next call to XML_Next().

 It looks as if in most cases the cleanup after a callback is quite simple.
  E.g. for startElementHandler (non-empty element) it is
    poolClear(&tempPool);
  for the endElementHandler it is a while loop dealing with namespace bindings.
  All we need to do is to have a switch statement at the entry of XML_Next()
  that performs the cleanup depending on how we retuned on the last call
  (assuming the cleanup was skipped when XML_USE was returned).

For an empty element, XML_Next is tricky, since Expat assumes
both callbacks (start and end) to happen without interruption by and end of buffer.

Options:

- We add an EMPTY_TAG node type.  Internal callbacks for
StartElementHandler and EndElementHandler would somehow detect the empty
tag and set up the XML_NodeInfo appropriately.

- We set up an emptyTagProcessor, so that the next call to
  XML_Next() will call this as the processor, which will
  return with the empty element's tag name. That's a common way
  to handle parser state in Expat.