steve at pearwood.info
Tue Aug 10 02:24:03 CEST 2010
On Mon, 9 Aug 2010 11:51:34 pm you wrote:
> Steven D'Aprano wrote:
> > On Mon, 9 Aug 2010 07:23:56 pm Dave Angel wrote:
> >> Big difference between 2.x and 3.x. In 3.x, strings are Unicode,
> >> and may be stored either in 16bit or 32bit form (Windows usually
> >> compiled using the former, and Linux the latter).
> > That's an internal storage that you (generic you) the Python
> > programmer doesn't see, except perhaps indirectly via memory
> > consumption.
> > Do you know how many bits are used to store floats? If you try:
> > <snip>
> You've missed including the context that I was responding to.
Possibly so, but I didn't miss *reading* the context, and it wasn't
clear to me exactly what you were trying to get across to Richard.
Maybe that was just my poor reading comprehension, or maybe the two of
you had gone of on a tangent that was confusing. At least to me.
> well aware of many historical architectures, and have dealt with the
> differences between the coding on an IBM 26 keypunch and an IBM 29.
I only know of these ancient machines second or third hand. In any case,
my mention of non-8-bit bytes was clearly marked as an aside, and not
meant to imply that the number of bits in a byte will vary from one
Python implementation to another. The point of my post was that the
internal storage of Unicode strings is virtually irrelevant to the
Python programmer. Strings are strings, and the way Python stores them
in memory is as irrelevant as the way it stores tuples, or floats, or
long ints, or None.
That is to say, the way they are stored will effect speed and memory
consumption, but as a Python programmer, we have very little say in the
matter. We deal with high-level objects. Unless we hack the Python
compiler, or possibly do strange and dangerous things with ctypes, we
don't have any access to the internal format of those objects.
> The OP was talking about the display of \xhh and thought he had
> discovered a discrepancy between the docs on 2.x and 3.x. And for
> that purpose it is quite likely relevant that 3.x has characters that
> won't fit in 8 bits, and thus be describable in two hex digits. I
> was trying to point out that characters in 3.x are more than 16 bits,
> and thus would require more than two hex digits.
The number of bytes used for the in-memory unicode implementations does
*not* relate to the number of bytes used when decoded to bytes. They're
Unicode strings are sequences of code points, integers between 0 and
10ffff in base-16, or 0 and 1114111 in base 10. The in-memory storage
of those code points is an implementation detail. The two most common
implementations are the 2-byte and 4-byte version, but even there it
will depend on whether your platform is big-endian or little-endian or
Take code point 61 (base-16), or the character 'a'. Does it matter
whether that is stored in memory as a two-byte chunk 0061 or 6100, or a
four-byte chunk 00000061, 00006100, 00610000 or 61000000, or something
else? No. When you print the character 'a', it prints as character 'a'
regardless of what the internal storage looks like. Characters are
characters, and the internal storage doesn't matter.
We could, if we wanted, write an implementation of Unicode in Python,
where the code points are 16 byte (128 bit) int objects. It would be
horribly slow, but it would still be Unicode, and the character 'a'
would be represented in memory by whatever the PyIntObject C data
structure happens to be. (Whatever it is, it won't be pretty.)
To get bytes, the internal storage of Unicode doesn't matter. You need
to specify an encoding, and the result you get depends on that
encoding, not the internal storage in memory:
>>> s = 'a' + chr(220)
> But a b'' string does not.
Naturally. By definition, each byte in a sequence of bytes is a single
> I don't usually use 3.1, but I was curious to discover that repr()
> won't display a string with an arbitrary Unicode character in it.
repr() doesn't display anything. repr() returns the string
representation, not the byte representation. Try this:
a = chr(300)
b = repr(a)
My prediction is that it will succeed, and not fail. Then try this:
My prediction is that it will fail with UnicodeEncodeError. It is is
your terminal that can't display arbitrary Unicode characters, because
your terminal have a weird encoding set. Fix the terminal, and you
won't have the problem:
>>> a = chr(300)
>>> print(a, repr(a))
There's almost never any good reason for using an encoding other than
> I realize that it can't produce a pair of bytes without a (non-ASCII)
No, you have that backwards. Strings encode to bytes. Bytes decode to
> but it doesn't make sense to me that repr() doesn't display
> something reasonable, like hex.
You are confused. repr() doesn't display anything, any more than len()
displays things. repr() returns a string, not bytes. What happens next
depends on what you do with it.
> FWIW, my sys.stdout.encoding is cp437.
Well, there's your problem.
More information about the Tutor