[Tutor] os.urandom()

Steven D'Aprano steve at pearwood.info
Tue Aug 10 02:24:03 CEST 2010


On Mon, 9 Aug 2010 11:51:34 pm you wrote:
> Steven D'Aprano wrote:
> > On Mon, 9 Aug 2010 07:23:56 pm Dave Angel wrote:
> >> Big difference between 2.x and 3.x.  In 3.x, strings are Unicode,
> >> and may be stored either in 16bit or 32bit form (Windows usually
> >> compiled using the former, and Linux the latter).
> >
> > That's an internal storage that you (generic you) the Python
> > programmer doesn't see, except perhaps indirectly via memory
> > consumption.
> >
> > Do you know how many bits are used to store floats? If you try:
> > <snip>
>
> You've missed including the context that I was responding to.

Possibly so, but I didn't miss *reading* the context, and it wasn't 
clear to me exactly what you were trying to get across to Richard. 
Maybe that was just my poor reading comprehension, or maybe the two of 
you had gone of on a tangent that was confusing. At least to me.


> I'm 
> well aware of many historical architectures, and have dealt with the
> differences between the coding on an IBM 26 keypunch and an IBM 29.  

I only know of these ancient machines second or third hand. In any case, 
my mention of non-8-bit bytes was clearly marked as an aside, and not 
meant to imply that the number of bits in a byte will vary from one 
Python implementation to another. The point of my post was that the 
internal storage of Unicode strings is virtually irrelevant to the 
Python programmer. Strings are strings, and the way Python stores them 
in memory is as irrelevant as the way it stores tuples, or floats, or 
long ints, or None.

That is to say, the way they are stored will effect speed and memory 
consumption, but as a Python programmer, we have very little say in the 
matter. We deal with high-level objects. Unless we hack the Python 
compiler, or possibly do strange and dangerous things with ctypes, we 
don't have any access to the internal format of those objects.


[...]
> The OP was talking about the display of \xhh  and thought he had
> discovered a discrepancy between the docs on 2.x and 3.x.  And for
> that purpose it is quite likely relevant that 3.x has characters that
> won't fit in 8 bits, and thus be describable in two hex digits.  I
> was trying to point out that characters in 3.x are more than 16 bits,
> and thus would require more than two hex digits.

The number of bytes used for the in-memory unicode implementations does 
*not* relate to the number of bytes used when decoded to bytes. They're 
independent.

Unicode strings are sequences of code points, integers between 0 and 
10ffff in base-16, or 0 and 1114111 in base 10. The in-memory storage 
of those code points is an implementation detail. The two most common 
implementations are the 2-byte and 4-byte version, but even there it 
will depend on whether your platform is big-endian or little-endian or 
something else.

Take code point 61 (base-16), or the character 'a'. Does it matter 
whether that is stored in memory as a two-byte chunk 0061 or 6100, or a 
four-byte chunk 00000061, 00006100, 00610000 or 61000000, or something 
else? No. When you print the character 'a', it prints as character 'a' 
regardless of what the internal storage looks like. Characters are 
characters, and the internal storage doesn't matter.

We could, if we wanted, write an implementation of Unicode in Python, 
where the code points are 16 byte (128 bit) int objects. It would be 
horribly slow, but it would still be Unicode, and the character 'a' 
would be represented in memory by whatever the PyIntObject C data 
structure happens to be. (Whatever it is, it won't be pretty.)

To get bytes, the internal storage of Unicode doesn't matter. You need 
to specify an encoding, and the result you get depends on that 
encoding, not the internal storage in memory:


>>> s = 'a' + chr(220)
>>> print(s)
aÜ
>>> s.encode('latin-1')
b'a\xdc'
>>> s.encode('utf-8')
b'a\xc3\x9c'
>>> s.encode('utf-16')
b'\xff\xfea\x00\xdc\x00'


> But a b'' string does not.

Naturally. By definition, each byte in a sequence of bytes is a single 
byte.


> I don't usually use 3.1, but I was curious to discover that repr()
> won't display a string with an arbitrary Unicode character in it.

repr() doesn't display anything. repr() returns the string 
representation, not the byte representation. Try this:

a = chr(300)
b = repr(a)

My prediction is that it will succeed, and not fail. Then try this:

print(a)

My prediction is that it will fail with UnicodeEncodeError. It is is 
your terminal that can't display arbitrary Unicode characters, because 
your terminal have a weird encoding set. Fix the terminal, and you 
won't have the problem:

>>> a = chr(300)
>>> print(a, repr(a))
Ĭ 'Ĭ'
>>> sys.stdout.encoding
'UTF-8'

There's almost never any good reason for using an encoding other than 
utf-8.


> I realize that it can't produce a pair of bytes without a (non-ASCII)
> decoding, 

No, you have that backwards. Strings encode to bytes. Bytes decode to 
strings.

> but it doesn't make sense to me that repr() doesn't display 
> something reasonable, like hex.  

You are confused. repr() doesn't display anything, any more than len() 
displays things. repr() returns a string, not bytes. What happens next 
depends on what you do with it.


> FWIW, my sys.stdout.encoding is cp437.

Well, there's your problem.



-- 
Steven D'Aprano


More information about the Tutor mailing list