[XML-SIG] sgmlop and html parsing
Thomas B. Passin
tpassin at comcast.net
Wed Jan 14 09:26:17 EST 2004
Alexandre Fayolle wrote:
>>
>>This should happen only if self->unicode is false. This is XML parsing,
>>right? If so, you should enable self->unicode, and it will give you
>>a unicode character (in handle_data).
>
>
> This is netscape bookmark parsing, so this is not well formed XML (lots
> of tags are not closed).
>
> demo/xbel/ns_parse.py calls sax2exts.SGMLParserFactory.make_parser(), so
> I expect it to return an SGML parser, and not an XML reader.
I took a different approach. To parse Netscape bookmark files, I just
take the default parser, and handle the encoding downstream using a few
patches in the downstream code to handle encoding. (I have found that
setting the encoding to utf-8 works reliably in Mozilla-derived browsers
on Windows 2000.
Here is the relevant part of my modification to ns_parse.py -
import codecs
ENCODING = 'utf-8'
(encoder,decoder,reader,writer) = codecs.lookup(ENCODING)
ns_handler=NetscapeHandler()
the_parser = sax2exts.SGMLParserFactory.make_parser()
the_parser.setContentHandler(ns_handler)
the_parser.setProperty(handler.property_encoding, ENCODING)
file = open(sys.argv[1], 'r')
the_parser.parse(file)
bms = ns_handler.bms
if len(sys.argv)==3:
out=writer(open(sys.argv[2],"w"))
bms.dump_xbel(out,ENCODING)
out.close()
else:
out = writer(sys.stdout)
bms.dump_xbel(out,ENCODING)
You need to pass an ENCODING along so that the eventual serializer can
put it into the xml declaration. Here, with utf-8, the declaration
could be omitted, but I seem to need to use iso-8859-1 for IE, so it is
needed.
Oh, yes, I escape characters in the serializer, too. These changes only
require small changes to the existing code, but they are really needed.
Example -
def dump_xbel(self,out,encoding):
if self.id:
ID = ' id="%s"' % self.id
else:
ID = ""
out.write('<?xml version="1.0" encoding="%s"?>\n<xbel%s>\n' %
(encoding,ID))
if self.title:
title_str = " <title>%s</title>\n" % escape(self.title)
out.write(title_str)
# ... etc.
In a year, I have not had one failure with my Mozilla, Firebird, and IE
bookmarks. After serializing to XBEL format, I run the files through
several xslt stylesheets, so any encoding problems would surface at that
point. Before I added encoding and escaping to the code, I was being
driven nuts by encoding problems.
Cheers,
Tom P
More information about the XML-SIG
mailing list