[Python-ideas] Re: Custom string prefixes

1 Sep 2019

      On Sun, Sep 1, 2019 at 10:47 AM Steven D'Aprano  wrote:
...
On Sat, Aug 31, 2019 at 09:31:15PM +1000, Chris Angelico wrote:
...
On Sat, Aug 31, 2019 at 8:44 PM Steven D'Aprano  wrote:
...
...
So b"abc" should not be allowed?
In what way are byte-STRINGS not strings? Unicode-strings and
byte-strings share a significant fraction of their APIs, and are so
similar that back in Python 2.2 the devs thought it was a good idea to
try automagically coercing from one to the other.
I was careful to write *string* rather than *str*. Sorry if that wasn't
clear enough.
We call it a string, but a bytes object has as much in common with
bytearray and with a list of integers as it does with a text string.
I don't think that's true.
Older versions of Python had text and bytes be the same things. That
means that, for backward compatibility, they have some common methods.
But does that really mean that bytes can be uppercased? Or is it that
we allow bytes to be treated as ASCII-encoded text, which is then
uppercased, and then returned to being bytes?
...
py> b'abc'.upper()
b'ABC'
py> [1, 2, 3].upper()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'upper'
In Python2, byte-strings and Unicode strings were both subclasses of
type basestring. Although we have moved away from that shared base class
in Python3, it does demonstrate that conceptually bytes and str are
closely related to each other.
Or does it actually demonstrate that Python 3 maintains backward
compatibility with Python 2?
...
...
Is the contents of a MIDI file a "string"? I would say no, it's not -
but it can *contain* strings, eg for metadata and lyrics.
You can't upper-case the
variable-length-integer b"\xe7\x61" any more than you can upper-case
the integer 13281.
Of course you can.
py> b"\xe7\x61".upper()
b'\xe7A'
Whether it is *meaningful* to do so is another question. But the same
applies to str.upper: just because you can call the method doesn't mean
that the result will be semantically valid.
So what did you actually do here? You took some bytes that represent
an integer, and you called a method on it that makes no sense
whatsoever, because now it represents a different integer. There's no
sense in which your new bytes object represents an "upper-cased
version of" the integer 13281. If I were to decode that string to text
and THEN uppercase it, it might give a quite different result:
...
...
...
b"\xe7\x61".decode("Latin-1").upper().encode("Latin-1")
b'\xc7A'
And if you choose some other encoding than Latin-1, you might get
different results again. I put it to you that bytes.upper() exists
more for backward compatibility with Python 2 than because a bytes
object is, in some way, uppercaseable.
...
source = "def spam():\n\tpass\n"
    source = source.upper()  # no longer valid Python source code.
But it started out as text, and it is now uppercase text. When you do
that with bytes, you have to first layer in "this is actually encoded
text", and you are then able to destroy that.
...
...
Bytes and text have a long relationship, and as such, there are
special similarities. That doesn't mean that bytes ARE text,
I didn't say that bytes are (human-readable) text. Although they can be:
not every application needs Unicode strings, ASCII strings are still
special, and there are still applications where once has to mix binary
and ASCII text data.
I said they were *strings*. Strings are not necessarily text, although
they often are. Formally, a string is a finite sequence of symbols that
are chosen from a set called an alphabet. See:
https://en.wikipedia.org/wiki/String_%28computer_science%29
A finite sequence of symbols... you mean like a list of integers
within the range [0, 255]? Nothing in that formal definition says that
a "string" of anything other than characters should be meaningfully
treated as text.
...
...
I don't think it's necessary to be too adamant about "must be some
sort of thing-we-call-string" here. Let practicality rule, since
purity has already waved a white flag at us.
It is because of *practicality* that we should prefer that things that
look similar should be similar. Code is read far more often that it is
written, and if you read two pieces of code that look similar, we should
strongly prefer that they should actually be similar.
And you have yet to prove that this similarity is actually a thing.
...
Would you be happy with a Pythonesque language that used prefixed
strings as the delimiter for arbitrary data types?
mylist = L"1, 2, None, {}, L"", 99.5"
mydict = D"key: value, None: L"", "abc": "xyz""
myset = S"1, 2, None"
At some point it's meaningless to call it a "Pythonesque" language,
but I've worked with plenty of languages that simply do not have data
types this rich, and so everything is manipulated the exact same way.
When a list of values is represented as ";item 1;item 2;item 3"
(actually as a string), or when you unpack a URL to find that it has
JSON embedded inside it, the idea of a "prefixed string" that tells
you exactly what data type is coming would be a luxury.
...
That's what this proposal wants: string syntax that can return arbitrary
data types.
How about using quotes for function calls?
assert chr"9" == "\t"
assert ord"9" == 57
That's what this proposal wants: string syntax for a subset of function
calls.
Don't say that this proposal won't be abused. Every one of the OP's
motivating examples is an abuse of the syntax, returning non-strings
from something that looks like a string.
Let's look at regular expressions. JavaScript has a syntax for them
involving leading and trailing slashes, borrowed from Perl, but I
can't figure out whether a regex is a first-class object in Perl. So
you can do something like this:
...
findme = /spa*m/
"This has spaaaaaam in it".match(findme)
[ 'spaaaaaam', index: 9, input: 'This has spaaaaaam in it' ]
In Python, I can do the exact same thing, only using double quotes as
the delimiter.
...
...
...
re.search("spa*m", "This has spaaaaam in it")

So what do you mean by "non-string" exactly? In what way is a regular
expression "not a string", yet the byte-encoded form of an integer
somehow is? It makes absolutely no sense to uppercase an integer, yet
you could uppercase a regex (since all regex special characters are
non-letters, this will make it match uppercase strings). Yet when you
encode the string as bytes, it gains an upper() method, and when you
encode a regex as a compiled regex object, it loses one. Why do you
insist that a regex is somehow not a string, but b"\xe7\x61" is?

ChrisA