[XML-SIG] sgmlop and html parsing

Thomas B. Passin tpassin at comcast.net
Wed Jan 14 09:26:17 EST 2004


Alexandre Fayolle wrote:
>>
>>This should happen only if self->unicode is false. This is XML parsing,
>>right? If so, you should enable self->unicode, and it will give you
>>a unicode character (in handle_data).
> 
> 
> This is netscape bookmark parsing, so this is not well formed XML (lots
> of tags are not closed). 
> 
> demo/xbel/ns_parse.py calls sax2exts.SGMLParserFactory.make_parser(), so
> I expect it to return an SGML parser, and not an XML reader. 

I took a different approach.  To parse Netscape bookmark files, I just 
take the default parser, and handle the encoding downstream using a few 
patches in the downstream code to handle encoding. (I have found that 
setting the encoding to utf-8 works reliably in Mozilla-derived browsers 
on Windows 2000.

Here is the relevant part of my modification to ns_parse.py -

     import codecs

     ENCODING = 'utf-8'
     (encoder,decoder,reader,writer) = codecs.lookup(ENCODING)

     ns_handler=NetscapeHandler()
     the_parser = sax2exts.SGMLParserFactory.make_parser()
     the_parser.setContentHandler(ns_handler)
     the_parser.setProperty(handler.property_encoding, ENCODING)

     file = open(sys.argv[1], 'r')
     the_parser.parse(file)
     bms = ns_handler.bms

     if len(sys.argv)==3:
         out=writer(open(sys.argv[2],"w"))
         bms.dump_xbel(out,ENCODING)
         out.close()
     else:
         out = writer(sys.stdout)
         bms.dump_xbel(out,ENCODING)

You need to pass an ENCODING along so that the eventual serializer can 
put it into the xml declaration.  Here, with utf-8, the declaration 
could be omitted, but I seem to need to use iso-8859-1 for IE, so it is 
needed.

Oh, yes, I escape characters in the serializer, too.  These changes only 
require small changes to the existing code, but they are really needed. 
  Example -

     def dump_xbel(self,out,encoding):
         if self.id:
             ID = ' id="%s"' % self.id
         else:
             ID = ""
         out.write('<?xml version="1.0" encoding="%s"?>\n<xbel%s>\n' % 
(encoding,ID))

         if self.title:
             title_str = "  <title>%s</title>\n" % escape(self.title)
             out.write(title_str)

         # ... etc.

In a year, I have not had one failure with my Mozilla, Firebird, and IE 
bookmarks.  After serializing to XBEL format, I run the files through 
several xslt stylesheets, so any encoding problems would surface at that 
point.  Before I added encoding and escaping to the code, I was being 
driven nuts by encoding problems.

Cheers,

Tom P




More information about the XML-SIG mailing list