Martin v. Löwis
martin at v.loewis.de
Wed Nov 5 20:27:59 CET 2003
"Ezequiel, Justin" <j.ezequiel at spitech.com> writes:
> I am converting XML files with entities to utf-8 using a lookup table:
> ⏞ 0FE37
> ⏟ 0FE38
> <sc>O</sc> 1D4AA
The last one is not an XML entity reference, of course. Also, you are
not converting to UTF-8, atleast not in this table - you convert to
Unicode code points.
> I have no idea what I am doing but I sure think that I absolutely
> need it.
If you eventually need UTF-8, you might just as well create a mapping
table that translates to UTF-8.
> Can you explain more on non-BMP characters (and Python's
> capabilities to represent these) and how it applies (if it does) to
> my needs?
Well, the BMP (basic multilingual plane) is the first 65536 characters
of Unicode. Recent Unicode revisions added characters beyond the first
64k, for characters rarely used; the MathML characters got allocated
there as well.
Python traditionally was using a two-byte type to represent Unicode,
so it cannot represent characters outside the BMP, atleast not in
Unicode strings of length 1. If you compile Python with --enable-ucs4,
you can readily represent all these characters. If you have only
UCS-2, you need two-character surrogate pairs to represent non-BMP
characters; this is called UTF-16.
If you want to learn more about UTF-16, see
Python supports UTF-16 in the following contexts:
- encoding and decoding surrogate pairs in the UTF-8 codec
- representing surrogate pairs as a single \U unicode string
Other aspects of UTF-16, such as distinguishing between the length of
a string in code points vs. the length of the string in code units are
More information about the Python-list