Unicode and rdf

Wed Mar 10 00:41:30 EST 2004

I'm trying to parse the rdf dumps from dmoz.org (Open Directory
Project) and am having great difficulty just getting Python to read
the files.  The files are RDF in UTF-8 encoding according to the
dmoz.org web site, but I get the following error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position
52376-52378: invalid data

Here's a sample of code that will reproduce the problem:

import sys
import codecs
from xml.sax import make_parser, handler

def main():
    f = codecs.open(sys.argv[1], 'r', 'utf-8')
    parser = make_parser()
    parser.setContentHandler(dmoz())
    parser.parse(f)

class dmoz(handler.ContentHandler):
    def startElement(self, name, attrs):
        print('%s' % name)

if(__name__=='__main__'):
    main()

I'm working with the dump from February 23rd, 2004.  On the dmoz.org
web site news pertaining to the rdf dumps, there is an entry from
March 3rd, 2003 which states that they are filtering the data to
"prevent UTF-8 and XML character encoding problems".  So I am assuming
that the UTF-8 files I have are valid.  I run into the problem with
both the structure.rdf.u8 file and the content.rdf.u8 file.

What am I doing wrong?

-Richard

dmoz.org rdf dumps: http://rdf.dmoz.org/

dmoz.org rdf news: http://rdf.dmoz.org/rdf/Changes.html