[XML-SIG] sgmlop and html parsing
Alexandre Fayolle
Alexandre.Fayolle at logilab.fr
Wed Jan 14 04:08:58 EST 2004
On Tue, Jan 13, 2004 at 09:31:36PM +0100, "Martin v. Löwis" wrote:
> Alexandre Fayolle wrote:
>
> >I've looked in the code, and I'm not sure how I can handle this, because
> >encoding issues in drv_sgmlop.py only seem to be handled in the callback
> >methods, and this problem occurs during before callbacks get called.
>
> This should happen only if self->unicode is false. This is XML parsing,
> right? If so, you should enable self->unicode, and it will give you
> a unicode character (in handle_data).
This is netscape bookmark parsing, so this is not well formed XML (lots
of tags are not closed).
demo/xbel/ns_parse.py calls sax2exts.SGMLParserFactory.make_parser(), so
I expect it to return an SGML parser, and not an XML reader.
> If you want to fix it in sgmlop instead of in the application, you could
> do what the comment suggests: encode the charref as UTF-8, and pass a
> byte string. This is error-prone, though: the application may not expect
> UTF-8.
I'm not too keen on this approach.
> As another alternative, in the application, you could activate the
> handle_charref callback - it is actually considered *before* sgmlop
> tries to deal with the character reference itself.
I'll give this a try, and keep the list posted. Thanks for your quick
answer.
--
Alexandre Fayolle
LOGILAB, Paris (France).
http://www.logilab.com http://www.logilab.fr http://www.logilab.org
Développement logiciel avancé - Intelligence Artificielle - Formations
More information about the XML-SIG
mailing list