[Python-ideas] Proposal for default character representation

Thu Oct 13 06:05:33 EDT 2016

> On 13 Oct 2016, at 09:43, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> 
> Mikhail V wrote:
>> Did you see much code written with hex literals?
> 
> From /usr/include/sys/fcntl.h:
> 

Backing Greg up for a moment, hex literals are extremely common in any code that needs to work with binary data, such as network programming or fine data structure manipulation. For example, consider the frequent requirement to mask out certain bits of a given integer (e.g., keep the low 24 bits of a 32 bit integer). Here are a few ways to represent that:

integer & 0x00FFFFFF  # Hex
integer & 16777215  # Decimal
integer & 0o77777777  # Octal
integer & 0b111111111111111111111111  # Binary

Of those four, hexadecimal has the advantage of being both extremely concise and clear. The octal representation is infuriating because one octal digit refers to *three* bits, which means that there is a non-whole number of octal digits in a byte (that is, one byte with all bits set is represented by 0o377). This causes problems both with reading comprehension and with most other common tasks. For example, moving from 0xFF to 0xFFFF (or 255 to 65535, also known as setting the next most significant byte to all 1) is represented in octal by moving from 0o377 to 0o177777. This is not an obvious transition, and I doubt many programmers could do it from memory in any representation but hex or binary.

Decimal is no clearer. Programmers know how to represent certain bit patterns from memory in decimal simply because they see them a lot: usually they can do the all 1s case, and often the 0 followed by all 1s case (255 and 128 for one byte, 65535 and 32767 for two bytes, and then increasingly few programmers know the next set). But trying to work out what mask to use for setting only bits 15 and 14 is tricky in decimal, while in hex it’s fairly easy (in hex it’s 0xC000, in decimal it’s 49152).

Binary notation seems like the solution, but note the above case: the only way to work out how many bits are being masked out is to count them, and there can be quite a lot. IIRC there’s some new syntax coming for binary literals that would let us represent them as 0b1111_1111_1111_1111, which would help the readability case, but it’s still substantially less dense and loses clarity for many kinds of unusual bit patterns. Additionally, as the number of bits increases life gets really hard: masking out certain bits of a 64-bit number requires a literal that’s at least 66 characters long, not including the underscores that would add another 15 underscores for a literal that is 81 characters long (more than the PEP8 line width recommendation). That starts getting unwieldy fast, while the hex representation is still down at 18 characters.

Hexadecimal has the clear advantage that each character wholly represents 4 bits, and the next 4 bits are independent of the previous bits. That’s not true of decimal or octal, and while it’s true of binary it costs a fourfold increase in the length of the representation. It’s definitely not as intuitive to the average human being, but that’s ok: it’s a specialised use case, and we aren’t requiring that all human beings learn this skill.

This is a very long argument to suggest that your argument against hexadecimal literals (namely, that they use 16 glyphs as opposed to the 10 glyphs used in decimal) is an argument that is too simple to be correct. Different collections of glyphs are clearer in different contexts. For example, decimal numerals can be represented using 10 glyphs, while the english language requires 26 glyphs plus punctuation. But I don’t think you’re seriously proposing we should swap from writing English using the larger glyph set to writing it in decimal representation of ASCII bytes.

Given this, I think the argument that says that the Unicode consortium said “write the number in hex” is good enough for me.

Cory