[Expat-discuss] Split UTF-8 sequence possible?

Jeff Garbers jgarbers at xltsoftware.com
Mon Nov 10 11:13:37 EST 2003


Having just overcome the newbie problem of not realizing that expat
feeds UTF-8 sequences to my handlers, I'm now wondering if
expat ever splits a multi-byte UTF-8 sequence across two calls to my
character handler callback.

For example, say there's a non-ASCII accented character
in its input character data (however it may have been encoded).
expat will want to send me a two-byte UTF-8 sequence.  If there's
only one byte left in the output buffer, will it (1) call my character 
data
callback with the buffer one short of capacity, and save the two-byte
sequence for the next callback, or (2) put the first of the two UTF-8
bytes in the buffer, call my callback, and then put the second at the
start of the buffer for the NEXT callback?

I'm really hoping #1. Can anybody confirm this?

Thanks -- Jeff Garbers




More information about the Expat-discuss mailing list