A few questiosn about encoding
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Wed Jun 12 21:40:44 EDT 2013
On Wed, 12 Jun 2013 21:30:23 +0100, Nobody wrote:
> The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a
> total of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than
> 20 bits).
Same with UTF-8 and UTF-32, both of which are limited to U+10FFFF because
that is what Unicode is limited to.
The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but
that's not UTF-8, that's UTF-8-plus-extra-codepoints. Likewise the
mechanism of UTF-32 could go up to 0xFFFFFFFF, but doing so means you
don't have Unicode chars any more, and hence your byte-string is not
valid UTF-32:
py> b = b'\xFF'*8
py> b.decode('UTF-32')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
codepoint not in range(0x110000)
--
Steven
More information about the Python-list
mailing list