[Baypiggies] Handling unwanted Unicode \u2019 characters in XML

Chad Netzer chad.netzer at gmail.com
Fri Jul 4 02:28:02 CEST 2008


On Thu, Jul 3, 2008 at 2:08 PM, Stephen McInerney
<spmcinerney at hotmail.com> wrote:
> Chad,
>
> I'm on Solaris 10.

>> You need to execute all the statements. I'm having difficulty
>> understanding how the unicode literal U+2019 can map to U+00E2 like you
>> say.
>
> You're making a wrong assumption that 'â' must mean U+00E2, it's just some
> non-7-bit character which the shell objects to and mangles.

Ah, it all makes sense to me now.  Your terminal is using latin-1
encoding, and when you explicitly encode the character u'\u2019', to
utf-8, you get the three byte string '\xe2\x80\x99', the first byte of
which is circumflex 'a'.

The weird part is that in your first message, when you printed
u'\u2019'.encode('utf-8'), you said you got the circumflex 'a'
character (latin-1 0xE8), but your latest message indicates you
sometimes get circumflex 'a' followed by two more characters (Euro,
and Trademark), which makes more sense.  Hmmm... Those look like they
are actually Windows-1252 character values (0x80 and 0x99):

http://en.wikipedia.org/wiki/Windows-1252

In any case, it sounds like you need a more voracious "Unicode
HAMMER", which would convert the unicode RIGHT SINGLE QUOTATION MARK
into ascii APOSTROPHE (among other translational abominations), but a
simple unicode replace() might work.

ie.
>>> a = u'\u2019'
u'\u2019'
>>> a.replace(u'\u2019', u'\u0027')
u"'"   # Uhhh, that's a single apostrophe in there...

Obviously the above could be done more intelligently by matching left
quotations, etc., but its a quick and dirty kludge for now.

C


More information about the Baypiggies mailing list