Unicode and rdf

Neil Turner N.R.Turner at bradford.ac.uk
Sat Mar 20 16:20:03 EST 2004


"A.M. Kuchling" <amk at amk.ca> wrote in message 
> Oh dear.   
> 
> Around 2001/2002 I worked on Python code for processing dmoz dumps, but gave
> up because the data was so bad -- some categories included content in
> various Chinese encodings despite the file's claim to be UTF-8.  I
> eventually gave up because debugging a program that fails after running for
> six hours is really, really tedious.

I'm a senior editor at the ODP, so allow me to (attempt to) explain
the situation. Originally the ODP used different character encoding
for different languages - so while most of the directory used
ISO-8859-1 or Windows-1252, Japanese would use Shift_JIS, and so on.
This wasn't ideal, so we started the long road towards moving to
Unicode. Many of the non-English sections were cloned and converted to
Unicode in a testbed area, then the Unicode version was merged back
into the main directory - this merging process took place only a few
weeks ago.

A few weeks back, the entire site was switched over to Unicode - now,
if you enter any non-Unicode characters they will show as ?.
Naturally, with a directory containing over 4 million entries some
non-Unicode characters will be in there, but the aim is to eradicate
these in due course. I'll admit it isn't ideal though.

--
Neil Turner
http://dmoz.org/profiles/totalxsive.html
http://www.neilturner.me.uk/



More information about the Python-list mailing list