[Expat-discuss] XML_ParserReset

Karl Waclawek karl@waclawek.net
Fri Apr 19 20:20:02 2002


> Karl Waclawek writes:
>  > What about the XML_UNICODE patch then?
>  > Is really nobody interested in UTF-16 output?
> 
> I am, but that seems a slightly lower priority.  I'll try to look at
> it as well if I can manage enough time.
> 
>  > I supplied a patch a while ago, and it seems to work for me,
>  > but I have never really subjected it to any targetted Unicode
>  > testing, so that is what it would need.
> 
> The patch is currently in two files and a chunk of preprocessor magic
> in the comments.  Would it be possible to create a single patch
> against the current CVS? 

I am not sure I understand. Do you mean the two patches I submitted?
If yes, I can give you a diff of the combined patch (NSTriplet & XML_UNICODE)
against 1.95.2. 

But against the current CVS - does this mean I have to somehow
merge my patch in? Maybe the best approach is to look at my
diff, manually re-apply to the current CVS, and send you the
completed file(s)? (I believe only xmlparse.c and expat.h are affected).

> That would make it a lot easier for me to
> review.  If you can summarize the specific tests you think are needed,
> that would really help as well; 

I guess the best would be to run a few xml files through
both compiled versions of Expat (XML_UNICODE on and off),
and then use a Unicode converter on the output and check
if the cross-converted files match with the ones produced by Expat?
The problem is likely getting XML files with lots of characters
beyond the typical western set.

> I'd like to add tests for everything
> that gets changed in the library if I can, to ensure we're getting the
> results we think we are, and to avoid regressions as maintenance
> continues.  Any help in writing the tests would be appreciated as
> well.

Currently I am working on UTF-8 <--> UTF-16 converters for
an XML writer. Should be done sometime next week.
I can supply those, as a first step.
But they are not written in C, I am afraid.
 
> (One advantage of getting this one fixed is that the Python bindings
> will be able to avoid the current UTF-16 -> UTF-8 -> UTF-16 dance that
> happens now when the user wants Python Unicode strings instead of
> UTF-8; that's a lot of useless transformation that could be saved!)

Also, I think I remember someone writing about a Java wrapper using JNI.
Java is natively UTF-16 too, I believe.

Karl