[XML-SIG] 2 Qs: encoding & entities with xmlproc

Dan Libby danda@netscape.com
Tue, 08 Jun 1999 03:22:38 -0700

This is a multi-part message in MIME format.
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Lars Marius Garshol wrote:

> * Dan Libby
> |
> | 1) The version of xmlproc I have does not appear to support any
> | encoding other than "iso-8859-1".
> This is correct. (Well, US-ASCII will also work, as will anything else
> that is based on US-ASCII as long as you don't try to use funny
> characters in names or name tokens.)
> | [...]  Before, when we were using xmllib, it simply called
> | handle_xml(), where we were able to look at the encoding value and
> | make appropriate decisions at the application level.  Does xmlproc
> | have any equivalent functionality, perhaps in a more recent version?
> The functionality is there, but not used at the moment. If you look at
> the charconv module you'll see that it contains conversion code for
> various encodings as well as registry object for converters.

Yes, I saw that while I was grepping for something or other and figured it
looked interesting, but was not sure how to plug it in.

> If you want I can easily add the hooks that would let you use this
> functionality. The reason I haven't done this so far is that there
> seemed to be no demand for this functionality.

I would appreciate that.  (Consider this 'demand')  Actually, Jose is more
the demand than I am.   Those crazy i18n guys...  ;-)   If it is a simple
change, perhaps you can just send us a diff or something?

> | 2) Explanation: I need to preserve XML/HTML entities.  For example,
> | if the document contains % then I want to print that out
> | exactly, not the parsed/converted value.  If I don't do this, then
> | any random person can embed html markup, etc, which could break an
> | HTML page.
> Hmmm. The cleanest solution to this (from an XML/SGML point of view)
> is probably to use string.replace to escape all '<'s in character data
> when it is passed to you from the parser. That would also let you
> retain parser independence and is cleaner in the sense that it becomes
> more obvious what you're really doing.

Yes, that is actually the solution I came up with also.  It doesn't really
seem that clean to me, because if there is a character above 127 that we
want to replace with an entity, it gets funny depending on which encoding
is in use.  Whereas in the old model, we simply had a map from eg "180" to
"&#180;" that we returned to the parser and similarly things like "quot"
to "&amp;quot;".

I tried doing this with entity declarations in the DTD and xmlproc just
for kicks.  It would allow it for character based entity names, but didn't
allow any names starting with a numeric.  That means that &#60; would
still slip by, even though we could catch &lt;
<!ENTITY lt "&amp;#60;"> <!-- works ok -->
<!ENTITY 60 "&amp;#60;"> <!-- illegal -->
<!ENTITY #60 "&amp;#60;"> <!-- illegal -->
<!ENTITY "#60" "&amp;#60;"> <!-- illegal -->

> If you don't like the solution above you may want to subclass
> XMLProcessor in xmlproc.py and write your own versions of
> parse_charref and parse_ent_ref.

yeah... icky.  I like being parser independent.  ;-)

> Instead of rewriting parse_ent_ref you could also just declare the
> entities you need in the DTD, and break into the entity hashtable and
> modify the value of '&lt;'. (I can show you how.)

I think that is what I just mentioned trying above, but maybe you mean
something else?

> If you don't like any of these solutions, let me know, and we'll think
> of something.

Replacing afterwards seems to work ok.  Really we are mostly just
concerned with the "<" and ">".

> Also: do you need an option to disallow element and attribute
> declarations in the internal subset?

Sorry, I'm not sure what this means.  What is the internal subset?

BTW, Lars, I saw your name in an XML book my roommate just picked up. I
forget the title, but it listed xmlproc.  Oh, and just now I saw my friend
Jim's name on the python profiling page.  Totally random!


Content-Type: text/x-vcard; charset=us-ascii;
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;

org:Netscape Communications
adr:;;;Mountain View;CA;94043;USA
fn:Dan Libby