[Tutor] HTML encoding of character sets...

Wed May 3 17:16:45 CEST 2006

Hi,

I need to do some encoding of text that will be used in a web page.
The text has been translated into 16 different languages.
I've managed the manual translation of some of the more regular 
languages (French, Spanish, Italian etc...) , by
replacing characters like 'á' with the numeric entity &#225; etc...
This works when you only have a few characters like this in the text and 
they visually stand out.
However, I now have to move on to other languages like Arabic, Russian, 
Chinese, Hebrew, Japanese, Korean, Hindi and Polish.
In these languages, the sheer volume of characters that need to be 
encoded is huge.
For instance, the following text is a title on a web page (in Russian), 
asking the user to wait for the page to load:

Èäåò çàãðóçêà, ïîæàëóéñòà, ïîäîæäèòå…

It obviously looks like garbage unless you have your email reader set to 
a Russian text encoding.
But even if it appears correctly, the sheer number of characters that I 
will need to numerically encode is massive.

Does anyone know how I can automate this process?
I want to be able to read a string out of a translation file, pass it to 
a Python script and get back a list or string of numeric entities
that I can then bury in my HTML.

I had a play with a snippet of code from the Unicode chapter of 'Dive 
Into Python' (http://diveintopython.org/xml_processing/unicode.html)
but get the following error:

text = open('russian.txt', 'r').read()
converted_text = text.encode('koi8-r')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "c:\Python24\lib\encodings\koi8_r.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc8 in position 0: 
ordinal not in range(128)

Anybody got any ideas?

Many thanks,
Frank.