[XML-SIG] sgmlop and html parsing
Thomas B. Passin
tpassin at comcast.net
Wed Jan 14 09:26:17 EST 2004
Alexandre Fayolle wrote:
>>This should happen only if self->unicode is false. This is XML parsing,
>>right? If so, you should enable self->unicode, and it will give you
>>a unicode character (in handle_data).
> This is netscape bookmark parsing, so this is not well formed XML (lots
> of tags are not closed).
> demo/xbel/ns_parse.py calls sax2exts.SGMLParserFactory.make_parser(), so
> I expect it to return an SGML parser, and not an XML reader.
I took a different approach. To parse Netscape bookmark files, I just
take the default parser, and handle the encoding downstream using a few
patches in the downstream code to handle encoding. (I have found that
setting the encoding to utf-8 works reliably in Mozilla-derived browsers
on Windows 2000.
Here is the relevant part of my modification to ns_parse.py -
ENCODING = 'utf-8'
(encoder,decoder,reader,writer) = codecs.lookup(ENCODING)
the_parser = sax2exts.SGMLParserFactory.make_parser()
file = open(sys.argv, 'r')
bms = ns_handler.bms
out = writer(sys.stdout)
You need to pass an ENCODING along so that the eventual serializer can
put it into the xml declaration. Here, with utf-8, the declaration
could be omitted, but I seem to need to use iso-8859-1 for IE, so it is
Oh, yes, I escape characters in the serializer, too. These changes only
require small changes to the existing code, but they are really needed.
ID = ' id="%s"' % self.id
ID = ""
out.write('<?xml version="1.0" encoding="%s"?>\n<xbel%s>\n' %
title_str = " <title>%s</title>\n" % escape(self.title)
# ... etc.
In a year, I have not had one failure with my Mozilla, Firebird, and IE
bookmarks. After serializing to XBEL format, I run the files through
several xslt stylesheets, so any encoding problems would surface at that
point. Before I added encoding and escaping to the code, I was being
driven nuts by encoding problems.
More information about the XML-SIG