[Baypiggies] Handling unwanted Unicode \u2019 characters in XML

Thu Jul 3 23:08:37 CEST 2008

Chad,

I'm on Solaris 10. Below are your replies, but it's faster for you to call me.
[to everyone else who sent suggestions like latin1_to_ascii -- The UNICODE Hammer,
I'm reading them too. I'll send out a rollup when I finally figure out the best approach
for my context.]

> You need to execute all the statements.  I'm having difficulty
> understanding how the unicode literal U+2019 can map to U+00E2 like you say.

You're making a wrong assumption that 'â' must mean U+00E2, it's just some
non-7-bit character which the shell objects to and mangles.

> Execute all these statements with cut-n-paste and give us the results:
> 
>>> a = u'\u2019'
>>> b = u'\u00E2'
>>> print a
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2019' in position 0: ordinal not in range(256)
>>> print b
â
>>> print a.encode('utf-8')
â€™
>>> print b.encode('utf-8')
Ã¢
>>> ord(a)
8217
>>> ord(b)
226
>>> unichr(ord(a))
u'\u2019'
>>> unichr(ord(b))
u'\xe2'
>>> import sys
>>> sys.maxunicode
65535
>>> sys.byteorder
'big'

> It might be something trivial that I'm overlooking...  Also, you
> mentioned an exception when trying to print the literal?  I assume it
> was a UnicodeEncodeError?  I'd like to see what it was, in any case.

Yes, it was the usual culprit that thousands are plagued by:
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2019' in position 53: ordinal not in range(256)

Regards,
Stephen

_________________________________________________________________
Watch “Cause Effect,” a show about real people making a real difference.  Learn more.
http://im.live.com/Messenger/IM/MTV/?source=text_watchcause
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20080703/c23e8aa9/attachment.htm>