[Tutor] Entity to UTF-8
Ezequiel, Justin
j.ezequiel@spitech.com
Tue Apr 29 23:53:05 2003
Greetings.
I need to convert entities (α) in an XML file to the actual UTF-8 characters (?).
Currently, I am using this bit of code (prototype to see if I can actually do it).
This seems to work just fine but I was wondering if there are other ways of doing this.
##--------------------------------------------------
import codecs
import re
(utf8_encode, utf8_decode, utf8_reader, utf8_writer) = codecs.lookup("utf-8")
patt = '&#([^;]+);'
def ToUTF8(matchobj):
return unichr(long(matchobj.group(1)))
def GetUTF8(pth):
infile = utf8_reader(open(pth))
readstr = infile.read()
infile.close()
return readstr
def WriteUTF8(pth, str):
outf = utf8_writer(open(pth, 'w'))
outf.write(str)
outf.close()
ustr = GetUTF8('input.htm')
ustr = re.sub(patt, ToUTF8, ustr)
WriteUTF8('output.htm', ustr)
##--------------------------------------------------
sample input file (actual production files would be XML):
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<TITLE>TESTING</TITLE>
</HEAD>
<BODY>
<P> я ж щ ю ф й б ⋨ ⌣ > ☎ α</P>
<P> я ж щ ю ф й б ⋨ ⌣ > ☎ α</P>
</BODY>
</HTML>
sample output file:
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<TITLE>TESTING</TITLE>
</HEAD>
<BODY>
<P>? ? ? ? ? ? ? ? ? ? > ? ?</P>
<P>? ? ? ? ? ? ? ? ? ? > ? ?</P>
</BODY>
</HTML>
Can you point me to resources/tutorials if any for this?
Is there a HowTo for the codecs module?
Maybe there are other modules I should look at (XML?).
Actual (production) input files would most likely have α instead of α but α is also possible.
BTW, is there a built-in method to convert a Hex string ('3B1') to a long (945)?
I am currently using my own function (am too embarrassed to post it here).