[Expat-discuss] XML_ParserReset
Karl Waclawek
karl@waclawek.net
Fri Apr 19 20:20:02 2002
> Karl Waclawek writes:
> > What about the XML_UNICODE patch then?
> > Is really nobody interested in UTF-16 output?
>
> I am, but that seems a slightly lower priority. I'll try to look at
> it as well if I can manage enough time.
>
> > I supplied a patch a while ago, and it seems to work for me,
> > but I have never really subjected it to any targetted Unicode
> > testing, so that is what it would need.
>
> The patch is currently in two files and a chunk of preprocessor magic
> in the comments. Would it be possible to create a single patch
> against the current CVS?
I am not sure I understand. Do you mean the two patches I submitted?
If yes, I can give you a diff of the combined patch (NSTriplet & XML_UNICODE)
against 1.95.2.
But against the current CVS - does this mean I have to somehow
merge my patch in? Maybe the best approach is to look at my
diff, manually re-apply to the current CVS, and send you the
completed file(s)? (I believe only xmlparse.c and expat.h are affected).
> That would make it a lot easier for me to
> review. If you can summarize the specific tests you think are needed,
> that would really help as well;
I guess the best would be to run a few xml files through
both compiled versions of Expat (XML_UNICODE on and off),
and then use a Unicode converter on the output and check
if the cross-converted files match with the ones produced by Expat?
The problem is likely getting XML files with lots of characters
beyond the typical western set.
> I'd like to add tests for everything
> that gets changed in the library if I can, to ensure we're getting the
> results we think we are, and to avoid regressions as maintenance
> continues. Any help in writing the tests would be appreciated as
> well.
Currently I am working on UTF-8 <--> UTF-16 converters for
an XML writer. Should be done sometime next week.
I can supply those, as a first step.
But they are not written in C, I am afraid.
> (One advantage of getting this one fixed is that the Python bindings
> will be able to avoid the current UTF-16 -> UTF-8 -> UTF-16 dance that
> happens now when the user wants Python Unicode strings instead of
> UTF-8; that's a lot of useless transformation that could be saved!)
Also, I think I remember someone writing about a Java wrapper using JNI.
Java is natively UTF-16 too, I believe.
Karl