sgmlop: malformed charrefs?

Thu Mar 17 06:28:29 EST 2005

Magnus Lie Hetland wrote:
> According to The Sgmlop Module Handbook [1], the handle_entityref()
> callback is called for "malformed character entities". What does that
> mean, exactly? What is a malformed character entity? I've tried
> mis-spelling them (e.g., dropping the semicolon), but then they're
> (quite naturally) treated as text/data, with handle_data(). I've tried
> to use number that is too great, or (equivalently, it turns out) to
> use names instead of numbers, such as &#foo;. In these cases, I only
> get an exception, because the number is too high...
>
> So -- how can I produce a malformed character entity?

with sgmlop 1.1, the following script

class entity_handler:
    def handle_entityref(self, entityref):
        print "ENTITY", repr(entityref)

parser = sgmlop.XMLParser()
parser.register(entity_handler())
parser.feed("&-10;&/()=?;")

prints:

ENTITY '-10'
ENTITY '/()=?'

> And another thing... For the case where a numeric reference is too
> high (i.e. it can't be translated into a Unicode character) -- is it
> possible to ignore it (or replace it, as with encode/decode)?

if you don't do anything, it is ignored.

if you specify a handle_charref hook, the part between &# and ; is passed
to that method.

if you have a handle_entityref hook, but no handle_charref, the part between
& and ; is passed to handle_entityref.

</F>