[Baypiggies] Handling unwanted Unicode \u2019 characters in XML
spmcinerney at hotmail.com
Thu Jul 3 23:08:37 CEST 2008
I'm on Solaris 10. Below are your replies, but it's faster for you to call me.
[to everyone else who sent suggestions like latin1_to_ascii -- The UNICODE Hammer,
I'm reading them too. I'll send out a rollup when I finally figure out the best approach
for my context.]
> You need to execute all the statements. I'm having difficulty
> understanding how the unicode literal U+2019 can map to U+00E2 like you say.
You're making a wrong assumption that 'â' must mean U+00E2, it's just some
non-7-bit character which the shell objects to and mangles.
> Execute all these statements with cut-n-paste and give us the results:
>>> a = u'\u2019'
>>> b = u'\u00E2'
>>> print a
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2019' in position 0: ordinal not in range(256)
>>> print b
>>> print a.encode('utf-8')
>>> print b.encode('utf-8')
>>> import sys
> It might be something trivial that I'm overlooking... Also, you
> mentioned an exception when trying to print the literal? I assume it
> was a UnicodeEncodeError? I'd like to see what it was, in any case.
Yes, it was the usual culprit that thousands are plagued by:
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2019' in position 53: ordinal not in range(256)
Watch “Cause Effect,” a show about real people making a real difference. Learn more.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Baypiggies