[Tutor] Entity to UTF-8
Paul Tremblay
phthenry@earthlink.net
Wed Apr 30 20:01:16 2003
You probably already know this already, but I thought I'd offer it
anyway.
Your code has the lines:
patt = '&#([^;]+);'
ustr = re.sub(patt, ToUTF8, ustr)
I believe this is ineffecient, because python has to compile the regular
expression each time. This code should be more effecient:
patt = re.compile(r'&#[^;];')
ustr = re.sub(patt, ToUTF8, ustr)
I am struggling with unicode myself, so I am going to test out your code
and see if it helps me.
Paul
On Wed, Apr 30, 2003 at 11:53:10AM +0800, Ezequiel, Justin wrote:
> From: "Ezequiel, Justin" <j.ezequiel@spitech.com>
> To: "'tutor@python. org' (E-mail)" <tutor@python.org>
> Subject: [Tutor] Entity to UTF-8
> Date: Wed, 30 Apr 2003 11:53:10 +0800
>
> Greetings.
>
> I need to convert entities (α) in an XML file to the actual UTF-8 characters (?).
> Currently, I am using this bit of code (prototype to see if I can actually do it).
> This seems to work just fine but I was wondering if there are other ways of doing this.
>
> ##--------------------------------------------------
> import codecs
> import re
>
> (utf8_encode, utf8_decode, utf8_reader, utf8_writer) = codecs.lookup("utf-8")
> patt = '&#([^;]+);'
>
> def ToUTF8(matchobj):
> return unichr(long(matchobj.group(1)))
>
> def GetUTF8(pth):
> infile = utf8_reader(open(pth))
> readstr = infile.read()
> infile.close()
> return readstr
>
> def WriteUTF8(pth, str):
> outf = utf8_writer(open(pth, 'w'))
> outf.write(str)
> outf.close()
>
> ustr = GetUTF8('input.htm')
>
> ustr = re.sub(patt, ToUTF8, ustr)
>
> WriteUTF8('output.htm', ustr)
> ##--------------------------------------------------
>
> sample input file (actual production files would be XML):
> <HTML>
> <HEAD>
> <META http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
> <TITLE>TESTING</TITLE>
> </HEAD>
> <BODY>
> <P> я ж щ ю ф й б ⋨ ⌣ > ☎ α</P>
> <P> я ж щ ю ф й б ⋨ ⌣ > ☎ α</P>
> </BODY>
> </HTML>
>
> sample output file:
> <HTML>
> <HEAD>
> <META http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
> <TITLE>TESTING</TITLE>
> </HEAD>
> <BODY>
> <P>? ? ? ? ? ? ? ? ? ? > ? ?</P>
> <P>? ? ? ? ? ? ? ? ? ? > ? ?</P>
> </BODY>
> </HTML>
>
> Can you point me to resources/tutorials if any for this?
> Is there a HowTo for the codecs module?
> Maybe there are other modules I should look at (XML?).
>
> Actual (production) input files would most likely have α instead of α but α is also possible.
>
> BTW, is there a built-in method to convert a Hex string ('3B1') to a long (945)?
> I am currently using my own function (am too embarrassed to post it here).
>
> _______________________________________________
> Tutor maillist - Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
--
************************
*Paul Tremblay *
*phthenry@earthlink.net*
************************