[Baypiggies] Handling unwanted Unicode \u2019 characters in XML

Stephen McInerney spmcinerney at hotmail.com
Thu Jul 3 23:08:37 CEST 2008


I'm on Solaris 10. Below are your replies, but it's faster for you to call me.
[to everyone else who sent suggestions like latin1_to_ascii -- The UNICODE Hammer,
I'm reading them too. I'll send out a rollup when I finally figure out the best approach
for my context.]

> You need to execute all the statements.  I'm having difficulty
> understanding how the unicode literal U+2019 can map to U+00E2 like you say.

You're making a wrong assumption that 'â' must mean U+00E2, it's just some
non-7-bit character which the shell objects to and mangles.
> Execute all these statements with cut-n-paste and give us the results:
>>> a = u'\u2019'
>>> b = u'\u00E2'
>>> print a
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2019' in position 0: ordinal not in range(256)
>>> print b
>>> print a.encode('utf-8')
>>> print b.encode('utf-8')
>>> ord(a)
>>> ord(b)
>>> unichr(ord(a))
>>> unichr(ord(b))
>>> import sys
>>> sys.maxunicode
>>> sys.byteorder

> It might be something trivial that I'm overlooking...  Also, you
> mentioned an exception when trying to print the literal?  I assume it
> was a UnicodeEncodeError?  I'd like to see what it was, in any case.

Yes, it was the usual culprit that thousands are plagued by:
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2019' in position 53: ordinal not in range(256)


Watch “Cause Effect,” a show about real people making a real difference.  Learn more.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20080703/c23e8aa9/attachment.htm>

More information about the Baypiggies mailing list