[Baypiggies] Handling unwanted Unicode \u2019 characters in XML
cvrebert at gmail.com
Wed Jul 2 00:48:42 CEST 2008
On Tue, Jul 1, 2008 at 3:36 PM, Stephen McInerney
<spmcinerney at hotmail.com> wrote:
> Here's one for the XML people,
> I am using XML imported from FrameMaker, which contains the unwanted Unicode
> character '\u2019' (the character started out as a plain apostrophe in the
> source Frame document.)
> It seems this is a common issue with many word-processors (MS, Frame etc.)
> using the funky right- and left- leaning apostrophes. I see many references
> to this issue on the web.
> You can't print Unicode strings as is, it causes an exception, you must
> encode them (to ASCII).
> But the ASCII encoding of \u2019 is not very human-readable or useful:
That's UTF-8, not ASCII (there's a big difference), and you're seeing
the repr() of the encoded string, which is of course an ugly escape
If instead you print the encoded string, you get:
>>> print u'\u2019'.encode('utf-8')
Which is perfectly sensible. Same for other unicode chars.
Are you really sure you need this to be ASCII and not UTF-8? If so,
why do need it to be true ASCII?
> Hence I thought I should do a find or replace with a regex to map the
> unwanted \u2019 back to plain old apostrophe.
> (You can do Unicode regexes with re.compile(<pattern>, re.UNICODE))
> But then I thought:
> In the interest of preventing exceptions by making sure all Unicode
> characters are either mapped to ASCII
> or removed, it seems like I really want a Unicode version of
> string.maketrans() and string.translate(), which is deprecated.
> Can anyone tell me what that equivalent is, for Unicode fns?
> Use video conversation to talk face-to-face with Windows Live Messenger. Get
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
More information about the Baypiggies