[Tutor] Entity to UTF-8

Ezequiel, Justin j.ezequiel@spitech.com
Tue Apr 29 23:53:05 2003


Greetings.

I need to convert entities (α) in an XML file to the actual UTF-8 characters (?).
Currently, I am using this bit of code (prototype to see if I can actually do it).
This seems to work just fine but I was wondering if there are other ways of doing this.

##--------------------------------------------------
import codecs
import re

(utf8_encode, utf8_decode, utf8_reader, utf8_writer) = codecs.lookup("utf-8")
patt = '&#([^;]+);'

def ToUTF8(matchobj):
    return unichr(long(matchobj.group(1)))

def GetUTF8(pth):
    infile = utf8_reader(open(pth))
    readstr = infile.read()
    infile.close()
    return readstr

def WriteUTF8(pth, str):
    outf = utf8_writer(open(pth, 'w'))
    outf.write(str)
    outf.close()

ustr = GetUTF8('input.htm')

ustr = re.sub(patt, ToUTF8, ustr)

WriteUTF8('output.htm', ustr)
##--------------------------------------------------

sample input file (actual production files would be XML):
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<TITLE>TESTING</TITLE>
</HEAD>
<BODY>
<P>&#65279; &#1103; &#1078; &#1097; &#1102; &#1092; &#1081; &#1073; &#8936; &#8995; &#62; &#9742; &#945;</P>
<P>&#65279; &#1103; &#1078; &#1097; &#1102; &#1092; &#1081; &#1073; &#8936; &#8995; &#62; &#9742; &#945;</P>
</BODY>
</HTML>

sample output file:
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<TITLE>TESTING</TITLE>
</HEAD>
<BODY>
<P>? ? ? ? ? ? ? ? ? ? > ? ?</P>
<P>? ? ? ? ? ? ? ? ? ? > ? ?</P>
</BODY>
</HTML>

Can you point me to resources/tutorials if any for this?
Is there a HowTo for the codecs module?
Maybe there are other modules I should look at (XML?).

Actual (production) input files would most likely have &alpha; instead of &#945; but &#x3B1; is also possible.

BTW, is there a built-in method to convert a Hex string ('3B1') to a long (945)?
I am currently using my own function (am too embarrassed to post it here).