[XML-SIG] sgmlop and html parsing

Alexandre Fayolle Alexandre.Fayolle at logilab.fr
Wed Jan 14 04:08:58 EST 2004


On Tue, Jan 13, 2004 at 09:31:36PM +0100, "Martin v. Löwis" wrote:
> Alexandre Fayolle wrote:
> 
> >I've looked in the code, and I'm not sure how I can handle this, because
> >encoding issues in drv_sgmlop.py only seem to be handled in the callback
> >methods, and this problem occurs during before callbacks get called. 
> 
> This should happen only if self->unicode is false. This is XML parsing,
> right? If so, you should enable self->unicode, and it will give you
> a unicode character (in handle_data).

This is netscape bookmark parsing, so this is not well formed XML (lots
of tags are not closed). 

demo/xbel/ns_parse.py calls sax2exts.SGMLParserFactory.make_parser(), so
I expect it to return an SGML parser, and not an XML reader. 



> If you want to fix it in sgmlop instead of in the application, you could
> do what the comment suggests: encode the charref as UTF-8, and pass a
> byte string. This is error-prone, though: the application may not expect
> UTF-8.

I'm not too keen on this approach. 

 
> As another alternative, in the application, you could activate the 
> handle_charref callback - it is actually considered *before* sgmlop
> tries to deal with the character reference itself.

I'll give this a try, and keep the list posted. Thanks for your quick
answer. 


-- 
Alexandre Fayolle
LOGILAB, Paris (France).
http://www.logilab.com   http://www.logilab.fr  http://www.logilab.org
Développement logiciel avancé - Intelligence Artificielle - Formations



More information about the XML-SIG mailing list