Unicode and rdf

Neil Turner N.R.Turner at bradford.ac.uk
Sat Mar 20 22:20:03 CET 2004

"A.M. Kuchling" <amk at amk.ca> wrote in message 
> Oh dear.   
> Around 2001/2002 I worked on Python code for processing dmoz dumps, but gave
> up because the data was so bad -- some categories included content in
> various Chinese encodings despite the file's claim to be UTF-8.  I
> eventually gave up because debugging a program that fails after running for
> six hours is really, really tedious.

I'm a senior editor at the ODP, so allow me to (attempt to) explain
the situation. Originally the ODP used different character encoding
for different languages - so while most of the directory used
ISO-8859-1 or Windows-1252, Japanese would use Shift_JIS, and so on.
This wasn't ideal, so we started the long road towards moving to
Unicode. Many of the non-English sections were cloned and converted to
Unicode in a testbed area, then the Unicode version was merged back
into the main directory - this merging process took place only a few
weeks ago.

A few weeks back, the entire site was switched over to Unicode - now,
if you enter any non-Unicode characters they will show as ?.
Naturally, with a directory containing over 4 million entries some
non-Unicode characters will be in there, but the aim is to eradicate
these in due course. I'll admit it isn't ideal though.

Neil Turner

More information about the Python-list mailing list