[Baypiggies] Handling unwanted Unicode \u2019 characters in XML
Stephen McInerney
spmcinerney at hotmail.com
Thu Jul 3 23:08:37 CEST 2008
Chad,
I'm on Solaris 10. Below are your replies, but it's faster for you to call me.
[to everyone else who sent suggestions like latin1_to_ascii -- The UNICODE Hammer,
I'm reading them too. I'll send out a rollup when I finally figure out the best approach
for my context.]
> You need to execute all the statements. I'm having difficulty
> understanding how the unicode literal U+2019 can map to U+00E2 like you say.
You're making a wrong assumption that 'â' must mean U+00E2, it's just some
non-7-bit character which the shell objects to and mangles.
> Execute all these statements with cut-n-paste and give us the results:
>
>>> a = u'\u2019'
>>> b = u'\u00E2'
>>> print a
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2019' in position 0: ordinal not in range(256)
>>> print b
â
>>> print a.encode('utf-8')
â
>>> print b.encode('utf-8')
â
>>> ord(a)
8217
>>> ord(b)
226
>>> unichr(ord(a))
u'\u2019'
>>> unichr(ord(b))
u'\xe2'
>>> import sys
>>> sys.maxunicode
65535
>>> sys.byteorder
'big'
> It might be something trivial that I'm overlooking... Also, you
> mentioned an exception when trying to print the literal? I assume it
> was a UnicodeEncodeError? I'd like to see what it was, in any case.
Yes, it was the usual culprit that thousands are plagued by:
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2019' in position 53: ordinal not in range(256)
Regards,
Stephen
_________________________________________________________________
Watch “Cause Effect,” a show about real people making a real difference. Learn more.
http://im.live.com/Messenger/IM/MTV/?source=text_watchcause
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20080703/c23e8aa9/attachment.htm>
More information about the Baypiggies
mailing list