[XML-SIG] 2 Qs: encoding & entities with xmlproc
Tue, 08 Jun 1999 03:22:38 -0700
This is a multi-part message in MIME format.
Content-Type: text/plain; charset=us-ascii
Lars Marius Garshol wrote:
> * Dan Libby
> | 1) The version of xmlproc I have does not appear to support any
> | encoding other than "iso-8859-1".
> This is correct. (Well, US-ASCII will also work, as will anything else
> that is based on US-ASCII as long as you don't try to use funny
> characters in names or name tokens.)
> | [...] Before, when we were using xmllib, it simply called
> | handle_xml(), where we were able to look at the encoding value and
> | make appropriate decisions at the application level. Does xmlproc
> | have any equivalent functionality, perhaps in a more recent version?
> The functionality is there, but not used at the moment. If you look at
> the charconv module you'll see that it contains conversion code for
> various encodings as well as registry object for converters.
Yes, I saw that while I was grepping for something or other and figured it
looked interesting, but was not sure how to plug it in.
> If you want I can easily add the hooks that would let you use this
> functionality. The reason I haven't done this so far is that there
> seemed to be no demand for this functionality.
I would appreciate that. (Consider this 'demand') Actually, Jose is more
the demand than I am. Those crazy i18n guys... ;-) If it is a simple
change, perhaps you can just send us a diff or something?
> | 2) Explanation: I need to preserve XML/HTML entities. For example,
> | if the document contains % then I want to print that out
> | exactly, not the parsed/converted value. If I don't do this, then
> | any random person can embed html markup, etc, which could break an
> | HTML page.
> Hmmm. The cleanest solution to this (from an XML/SGML point of view)
> is probably to use string.replace to escape all '<'s in character data
> when it is passed to you from the parser. That would also let you
> retain parser independence and is cleaner in the sense that it becomes
> more obvious what you're really doing.
Yes, that is actually the solution I came up with also. It doesn't really
seem that clean to me, because if there is a character above 127 that we
want to replace with an entity, it gets funny depending on which encoding
is in use. Whereas in the old model, we simply had a map from eg "180" to
"´" that we returned to the parser and similarly things like "quot"
I tried doing this with entity declarations in the DTD and xmlproc just
for kicks. It would allow it for character based entity names, but didn't
allow any names starting with a numeric. That means that < would
still slip by, even though we could catch <
<!ENTITY lt "&#60;"> <!-- works ok -->
<!ENTITY 60 "&#60;"> <!-- illegal -->
<!ENTITY #60 "&#60;"> <!-- illegal -->
<!ENTITY "#60" "&#60;"> <!-- illegal -->
> If you don't like the solution above you may want to subclass
> XMLProcessor in xmlproc.py and write your own versions of
> parse_charref and parse_ent_ref.
yeah... icky. I like being parser independent. ;-)
> Instead of rewriting parse_ent_ref you could also just declare the
> entities you need in the DTD, and break into the entity hashtable and
> modify the value of '<'. (I can show you how.)
I think that is what I just mentioned trying above, but maybe you mean
> If you don't like any of these solutions, let me know, and we'll think
> of something.
Replacing afterwards seems to work ok. Really we are mostly just
concerned with the "<" and ">".
> Also: do you need an option to disallow element and attribute
> declarations in the internal subset?
Sorry, I'm not sure what this means. What is the internal subset?
BTW, Lars, I saw your name in an XML book my roommate just picked up. I
forget the title, but it listed xmlproc. Oh, and just now I saw my friend
Jim's name on the python profiling page. Totally random!
Content-Type: text/x-vcard; charset=us-ascii;
Content-Description: Card for Dan Libby