[Baypiggies] Handling unwanted Unicode \u2019 characters in XML

Wed Jul 2 07:34:03 CEST 2008

On Tue, Jul 1, 2008 at 3:36 PM, Stephen McInerney
<spmcinerney at hotmail.com> wrote:
> Here's one for the XML people,
>
> I am using XML imported from FrameMaker, which contains the unwanted Unicode
> character '\u2019' (the character started out as a plain apostrophe in the
> source Frame document.)
> It seems this is a common issue with many word-processors (MS, Frame etc.)
> using the funky right- and left- leaning apostrophes. I see many references
> to this issue on the web.
>
> You can't print Unicode strings as is, it causes an exception, you must
> encode them (to ASCII).
> But the ASCII encoding of \u2019 is not very human-readable or useful:
>>>> u'\u2019'.encode('utf-8')
> '\xe2\x80\x99'
>
> Hence I thought I should do a find or replace with a regex to map the
> unwanted \u2019 back to plain old apostrophe.
> (You can do Unicode regexes with re.compile(<pattern>, re.UNICODE))
>
> But then I thought:
> In the interest of preventing exceptions by making sure all Unicode
> characters are either mapped to ASCII
> or removed, it seems like I really want a Unicode version of
> string.maketrans() and string.translate(), which is deprecated.
> Can anyone tell me what that equivalent is, for Unicode fns?

That reminds me of this:

latin1_to_ascii -- The UNICODE Hammer -- AKA "The Stupid American"
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871

Now is probably as good a time as any to learn about Unicode.  Here's
an easy start:
http://wiki.pylonshq.com/display/pylonsdocs/Unicode

Happy Hacking!
-jj

-- 
It's a walled garden, but the flowers sure are lovely!
http://jjinux.blogspot.com/