On Sun, Sep 01, 2019 at 12:24:24PM +1000, Chris Angelico wrote:
Older versions of Python had text and bytes be the same things.
Whether a string object is *text* is a semantic question, and independent of what data format you use. 'Hello world!' is text, whether you are using Python 1.5 or Python 3.8. '\x01\x06\x13\0' is not text, whether you are using Python 1.5 or Python 3.8.
That means that, for backward compatibility, they have some common methods. But does that really mean that bytes can be uppercased?
I'm curious what you think that b'chris angelico'.upper() is doing, if it is not uppercasing the byte-string b'chris angelico'. Is it a mere accident that the result happens to be b'CHRIS ANGELICO'? Unicode strings are sequences of code-points, abstract integers between 0 and 1114111 inclusive. When you uppercase the Unicode string 'chris angelico', you're transforming the sequence of integers: U+0063,0068,0072,0069,0073,0020,0061,006e,0067,0065,006c,0069,0063,006f to this sequence of integers: U+0043,0048,0052,0049,0053,0020,0041,004e,0047,0045,004c,0049,0043,004f If you are prepared to call that "uppercasing", you should be prepared to do the same for the byte-string equivalent. (For the avoidance of doubt: this is independent of the encoding used to store those code points in memory or on disk. Encodings have nothing to do with this.) [...]
Or is it that we allow bytes to be treated as ASCII-encoded text, which is then uppercased, and then returned to being bytes?
I'm fairly confident that bytes methods aren't implemented by decoding to Unicode, applying the method, then re-encoding back to bytes. But even if they were, that's just an implementation detail. Imagine a float method that internally converted the float to a pair of integers (numerator/denominator), operated on that fraction, and then re-converted back to a float. I'm sure you wouldn't want to say that this proves that floats aren't numbers. The same applies to byte-strings. In the unlikely case that byte methods delegate to str methods, that doesn't mean byte-strings aren't strings. It just means that two sorts of strings can share a single implementation for their methods. Code reuse for the win! [...]
py> b"\xe7\x61".upper() b'\xe7A'
Whether it is *meaningful* to do so is another question. But the same applies to str.upper: just because you can call the method doesn't mean that the result will be semantically valid.
So what did you actually do here? You took some bytes that represent an integer,
For the sake of the argument I'll accept that *this particular* byte string represents an integer rather than a series of mixed binary data and ASCII text, or text in some unusual encoding, or pixels in an image, or any of a million other things it could represent. That's absolutely fine: if it doesn't make sense to call .upper() on your bytes, then don't call .upper() on them. Precisely as you wouldn't call .upper() on a str object, if it didn't make sense to do so.
and you called a method on it that makes no sense whatsoever, because now it represents a different integer.
The same applies to Unicode strings too. Any Unicode string method that transforms the input returns something that represents a different sequence of code-points, hence a different sequence of integers. Shall we agree that neither bytes nor Unicode are strings? No, I don't think so either :-)
If I were to decode that string to text and THEN uppercase it, it might give a quite different result:
Sure. If you perform *any* transformation on the data first, it might give a different result on uppercasing: - if you reverse the bytes, uppercasing gives a different result; - if you replace b'a' with b'e', uppercasing gives a different result etc. And exactly the same observation applies to str objects: - if you reverse the characters, uppercasing gives a different result; - if you replace 'a' with 'e', uppercasing gives a different result.
And if you choose some other encoding than Latin-1, you might get different results again.
Sure. The bytes methods like .upper() etc are predicated on the assumption that your bytes represent ASCII text. If your bytes represent something else, then calling the .upper() method may not be meaningful or useful. In other words... if your bytes string came from an ASCII text file, it's probably safe to uppercase it. If your bytes string came from a JPEG, then uppercasing them will probably make a mess of the image, if not corrupt the file. So don't do that :-) Analogy: ints support the unary minus operator. But if your int represents a mass, then negating it isn't meaningful. There's no such thing as -5 kg. Should we conclude from this that the int type in Python doesn't represent a number, and that the support of numeric operators and methods is merely for backwards compatibility? I think not. The formal definition of a string is a sequence of symbols from an alphabet. That is precisely what bytes objects are: the alphabet in this case is the 8-bit numbers 0 to 255 inclusive, which for usefulness, convenience and backwards compatibility can be optionally interpreted as the 7-bit ASCII character set plus another 128 abstract "characters".
I said they were *strings*. Strings are not necessarily text, although they often are. Formally, a string is a finite sequence of symbols that are chosen from a set called an alphabet. See:
A finite sequence of symbols... you mean like a list of integers within the range [0, 255]? Nothing in that formal definition says that a "string" of anything other than characters should be meaningfully treated as text.
Sure. If your bytes don't represent text, then methods like upper() probably won't do anything meaningful. It's still a string though.
I don't think it's necessary to be too adamant about "must be some sort of thing-we-call-string" here. Let practicality rule, since purity has already waved a white flag at us.
It is because of *practicality* that we should prefer that things that look similar should be similar. Code is read far more often that it is written, and if you read two pieces of code that look similar, we should strongly prefer that they should actually be similar.
And you have yet to prove that this similarity is actually a thing.
I'm not sure the onus is on me to prove this. "Status quo wins a stalemate." And surely the onus is on those proposing the new syntax to demonstrate that it will be fine to use string delimiters as function calls. You could make a good start by finding other languages, reasonably conventional languages with syntax based on the Algol or C tradition, that use quotes '' or "" to return arbitrary types. Even languages with unconventional syntax like Forth or APL would be a good start. Maybe I'm wrong. Maybe quotation marks are widely used for purposes other than delimiting strings, and I'm just too ignorant of other languages to know it. Maybe Python is in the minority here. Anyway, the bottom line is this: I have no objection to using prefixed quotes to represent Unicode strings, or byte strings, or Andrew's hypothetical UTF-16 strings, or EBCDIC strings, or TRON strings. https://en.wikipedia.org/wiki/TRON_(encoding) But I think that any API that would allow z"..." to represent (let's say) a socket, or a float, or a HTTP_Server instance, or a list, would be a deeply flawed API.
Let's look at regular expressions. JavaScript has a syntax for them involving leading and trailing slashes, borrowed from Perl, but I can't figure out whether a regex is a first-class object in Perl. So you can do something like this:
findme = /spa*m/ "This has spaaaaaam in it".match(findme) [ 'spaaaaaam', index: 9, input: 'This has spaaaaaam in it' ]
In Python, I can do the exact same thing, only using double quotes as the delimiter.
re.search("spa*m", "This has spaaaaam in it")
Sure. As a convenience, the re module has functions which accepts regular expression patterns as well as compiled regular expression objects.
So what do you mean by "non-string" exactly? In what way is a regular expression "not a string",
That question is ambiguous. Are you asking about regular expression patterns, or regular expression objects? Regular expression *patterns* are clearly strings: pattern = r'...' We type them with string delimiters, if you call type(pattern) it will return str, you can slice the pattern or uppercase it. Regular expression *objects* are just as clearly not strings: rx = re.compile(pattern) You can't slice them, they aren't sequences of symbols, they have attributes like rx.flags which have no meaning in a string, they lack string methods like upper, and those methods they have operate very differently from their equivalent string methods: pattern.find("X") # search pattern for "X" rx.search("X") # search "X" for pattern Regex objects are far more than just the regex pattern.
yet the byte-encoded form of an integer somehow is?
If your bytes represent an integer, then uppercasing them isn't meaningful. If your bytes represent ASCII text then uppercasing them may be meaningful.
It makes absolutely no sense to uppercase an integer, yet you could uppercase a regex (since all regex special characters are non-letters, this will make it match uppercase strings).
In general, you can't uppercase regex patterns without radically changing the meaning of them. Consider r'\d' and r'\D'.
Yet when you encode the string as bytes, it gains an upper() method, and when you encode a regex as a compiled regex object, it loses one. Why do you insist that a regex is somehow not a string, but b"\xe7\x61" is?
Because a byte-string matches the definition of strings, while compiled regex objects do not. -- Steven