On Sun, Sep 1, 2019 at 10:47 AM Steven D'Aprano
On Sat, Aug 31, 2019 at 09:31:15PM +1000, Chris Angelico wrote:
On Sat, Aug 31, 2019 at 8:44 PM Steven D'Aprano
wrote: So b"abc" should not be allowed?
In what way are byte-STRINGS not strings? Unicode-strings and byte-strings share a significant fraction of their APIs, and are so similar that back in Python 2.2 the devs thought it was a good idea to try automagically coercing from one to the other.
I was careful to write *string* rather than *str*. Sorry if that wasn't clear enough.
We call it a string, but a bytes object has as much in common with bytearray and with a list of integers as it does with a text string.
I don't think that's true.
Older versions of Python had text and bytes be the same things. That means that, for backward compatibility, they have some common methods. But does that really mean that bytes can be uppercased? Or is it that we allow bytes to be treated as ASCII-encoded text, which is then uppercased, and then returned to being bytes?
py> b'abc'.upper() b'ABC'
py> [1, 2, 3].upper() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'list' object has no attribute 'upper'
In Python2, byte-strings and Unicode strings were both subclasses of type basestring. Although we have moved away from that shared base class in Python3, it does demonstrate that conceptually bytes and str are closely related to each other.
Or does it actually demonstrate that Python 3 maintains backward compatibility with Python 2?
Is the contents of a MIDI file a "string"? I would say no, it's not - but it can *contain* strings, eg for metadata and lyrics. You can't upper-case the variable-length-integer b"\xe7\x61" any more than you can upper-case the integer 13281.
Of course you can.
py> b"\xe7\x61".upper() b'\xe7A'
Whether it is *meaningful* to do so is another question. But the same applies to str.upper: just because you can call the method doesn't mean that the result will be semantically valid.
So what did you actually do here? You took some bytes that represent an integer, and you called a method on it that makes no sense whatsoever, because now it represents a different integer. There's no sense in which your new bytes object represents an "upper-cased version of" the integer 13281. If I were to decode that string to text and THEN uppercase it, it might give a quite different result:
b"\xe7\x61".decode("Latin-1").upper().encode("Latin-1") b'\xc7A'
And if you choose some other encoding than Latin-1, you might get different results again. I put it to you that bytes.upper() exists more for backward compatibility with Python 2 than because a bytes object is, in some way, uppercaseable.
source = "def spam():\n\tpass\n" source = source.upper() # no longer valid Python source code.
But it started out as text, and it is now uppercase text. When you do that with bytes, you have to first layer in "this is actually encoded text", and you are then able to destroy that.
Bytes and text have a long relationship, and as such, there are special similarities. That doesn't mean that bytes ARE text,
I didn't say that bytes are (human-readable) text. Although they can be: not every application needs Unicode strings, ASCII strings are still special, and there are still applications where once has to mix binary and ASCII text data.
I said they were *strings*. Strings are not necessarily text, although they often are. Formally, a string is a finite sequence of symbols that are chosen from a set called an alphabet. See:
A finite sequence of symbols... you mean like a list of integers within the range [0, 255]? Nothing in that formal definition says that a "string" of anything other than characters should be meaningfully treated as text.
I don't think it's necessary to be too adamant about "must be some sort of thing-we-call-string" here. Let practicality rule, since purity has already waved a white flag at us.
It is because of *practicality* that we should prefer that things that look similar should be similar. Code is read far more often that it is written, and if you read two pieces of code that look similar, we should strongly prefer that they should actually be similar.
And you have yet to prove that this similarity is actually a thing.
Would you be happy with a Pythonesque language that used prefixed strings as the delimiter for arbitrary data types?
mylist = L"1, 2, None, {}, L"", 99.5"
mydict = D"key: value, None: L"", "abc": "xyz""
myset = S"1, 2, None"
At some point it's meaningless to call it a "Pythonesque" language, but I've worked with plenty of languages that simply do not have data types this rich, and so everything is manipulated the exact same way. When a list of values is represented as ";item 1;item 2;item 3" (actually as a string), or when you unpack a URL to find that it has JSON embedded inside it, the idea of a "prefixed string" that tells you exactly what data type is coming would be a luxury.
That's what this proposal wants: string syntax that can return arbitrary data types.
How about using quotes for function calls?
assert chr"9" == "\t"
assert ord"9" == 57
That's what this proposal wants: string syntax for a subset of function calls.
Don't say that this proposal won't be abused. Every one of the OP's motivating examples is an abuse of the syntax, returning non-strings from something that looks like a string.
Let's look at regular expressions. JavaScript has a syntax for them involving leading and trailing slashes, borrowed from Perl, but I can't figure out whether a regex is a first-class object in Perl. So you can do something like this:
findme = /spa*m/ "This has spaaaaaam in it".match(findme) [ 'spaaaaaam', index: 9, input: 'This has spaaaaaam in it' ]
In Python, I can do the exact same thing, only using double quotes as the delimiter.
re.search("spa*m", "This has spaaaaam in it")
So what do you mean by "non-string" exactly? In what way is a regular expression "not a string", yet the byte-encoded form of an integer somehow is? It makes absolutely no sense to uppercase an integer, yet you could uppercase a regex (since all regex special characters are non-letters, this will make it match uppercase strings). Yet when you encode the string as bytes, it gains an upper() method, and when you encode a regex as a compiled regex object, it loses one. Why do you insist that a regex is somehow not a string, but b"\xe7\x61" is? ChrisA