[Baypiggies] Handling unwanted Unicode \u2019 characters in XML
Shannon -jj Behrens
jjinux at gmail.com
Wed Jul 2 07:34:03 CEST 2008
On Tue, Jul 1, 2008 at 3:36 PM, Stephen McInerney
<spmcinerney at hotmail.com> wrote:
> Here's one for the XML people,
> I am using XML imported from FrameMaker, which contains the unwanted Unicode
> character '\u2019' (the character started out as a plain apostrophe in the
> source Frame document.)
> It seems this is a common issue with many word-processors (MS, Frame etc.)
> using the funky right- and left- leaning apostrophes. I see many references
> to this issue on the web.
> You can't print Unicode strings as is, it causes an exception, you must
> encode them (to ASCII).
> But the ASCII encoding of \u2019 is not very human-readable or useful:
> Hence I thought I should do a find or replace with a regex to map the
> unwanted \u2019 back to plain old apostrophe.
> (You can do Unicode regexes with re.compile(<pattern>, re.UNICODE))
> But then I thought:
> In the interest of preventing exceptions by making sure all Unicode
> characters are either mapped to ASCII
> or removed, it seems like I really want a Unicode version of
> string.maketrans() and string.translate(), which is deprecated.
> Can anyone tell me what that equivalent is, for Unicode fns?
That reminds me of this:
latin1_to_ascii -- The UNICODE Hammer -- AKA "The Stupid American"
Now is probably as good a time as any to learn about Unicode. Here's
an easy start:
It's a walled garden, but the flowers sure are lovely!
More information about the Baypiggies