[Baypiggies] Handling unwanted Unicode \u2019 characters in XML

Chris Rebert cvrebert at gmail.com
Wed Jul 2 00:48:42 CEST 2008


On Tue, Jul 1, 2008 at 3:36 PM, Stephen McInerney
<spmcinerney at hotmail.com> wrote:
> Here's one for the XML people,
>
> I am using XML imported from FrameMaker, which contains the unwanted Unicode
> character '\u2019' (the character started out as a plain apostrophe in the
> source Frame document.)
> It seems this is a common issue with many word-processors (MS, Frame etc.)
> using the funky right- and left- leaning apostrophes. I see many references
> to this issue on the web.
>
> You can't print Unicode strings as is, it causes an exception, you must
> encode them (to ASCII).
> But the ASCII encoding of \u2019 is not very human-readable or useful:
>>>> u'\u2019'.encode('utf-8')
> '\xe2\x80\x99'

That's UTF-8, not ASCII (there's a big difference), and you're seeing
the repr() of the encoded string, which is of course an ugly escape
sequence.
If instead you print the encoded string, you get:

>>> print u'\u2019'.encode('utf-8')
'

Which is perfectly sensible. Same for other unicode chars.

Are you really sure you need this to be ASCII and not UTF-8? If so,
why do need it to be true ASCII?

- Chris

>
> Hence I thought I should do a find or replace with a regex to map the
> unwanted \u2019 back to plain old apostrophe.
> (You can do Unicode regexes with re.compile(<pattern>, re.UNICODE))
>
> But then I thought:
> In the interest of preventing exceptions by making sure all Unicode
> characters are either mapped to ASCII
> or removed, it seems like I really want a Unicode version of
> string.maketrans() and string.translate(), which is deprecated.
> Can anyone tell me what that equivalent is, for Unicode fns?
>
> Thanks,
> Stephen
>
>
>
>
> ________________________________
> Use video conversation to talk face-to-face with Windows Live Messenger. Get
> started.
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>


More information about the Baypiggies mailing list