[Python-ideas] Re: Custom string prefixes

2 Sep 2019

      On Sun, Sep 01, 2019 at 12:24:24PM +1000, Chris Angelico wrote:
...
Older versions of Python had text and bytes be the same things.
Whether a string object is *text* is a semantic question, and 
independent of what data format you use. 'Hello world!' is text, whether 
you are using Python 1.5 or Python 3.8. '\x01\x06\x13\0' is not text, 
whether you are using Python 1.5 or Python 3.8.
...
That
means that, for backward compatibility, they have some common methods.
But does that really mean that bytes can be uppercased?
I'm curious what you think that b'chris angelico'.upper() is doing, if 
it is not uppercasing the byte-string b'chris angelico'. Is it a mere 
accident that the result happens to be b'CHRIS ANGELICO'?

Unicode strings are sequences of code-points, abstract integers between 
0 and 1114111 inclusive. When you uppercase the Unicode string 'chris 
angelico', you're transforming the sequence of integers:

U+0063,0068,0072,0069,0073,0020,0061,006e,0067,0065,006c,0069,0063,006f

to this sequence of integers:

U+0043,0048,0052,0049,0053,0020,0041,004e,0047,0045,004c,0049,0043,004f

If you are prepared to call that "uppercasing", you should be prepared 
to do the same for the byte-string equivalent.

(For the avoidance of doubt: this is independent of the encoding used to 
store those code points in memory or on disk. Encodings have nothing to 
do with this.)

[...]
...
Or is it that
we allow bytes to be treated as ASCII-encoded text, which is then
uppercased, and then returned to being bytes?
I'm fairly confident that bytes methods aren't implemented by decoding 
to Unicode, applying the method, then re-encoding back to bytes. But 
even if they were, that's just an implementation detail.

Imagine a float method that internally converted the float to a pair of 
integers (numerator/denominator), operated on that fraction, and then 
re-converted back to a float. I'm sure you wouldn't want to say that 
this proves that floats aren't numbers.

The same applies to byte-strings. In the unlikely case that byte methods 
delegate to str methods, that doesn't mean byte-strings aren't strings. 
It just means that two sorts of strings can share a single 
implementation for their methods. Code reuse for the win!

[...]
...
...
py> b"\xe7\x61".upper()
b'\xe7A'
Whether it is *meaningful* to do so is another question. But the same
applies to str.upper: just because you can call the method doesn't mean
that the result will be semantically valid.
So what did you actually do here? You took some bytes that represent
an integer,
For the sake of the argument I'll accept that *this particular* byte 
string represents an integer rather than a series of mixed binary data 
and ASCII text, or text in some unusual encoding, or pixels in an image, 
or any of a million other things it could represent.

That's absolutely fine: if it doesn't make sense to call .upper() on 
your bytes, then don't call .upper() on them. Precisely as you wouldn't 
call .upper() on a str object, if it didn't make sense to do so.
...
and you called a method on it that makes no sense
whatsoever, because now it represents a different integer.
The same applies to Unicode strings too. Any Unicode string method that 
transforms the input returns something that represents a different 
sequence of code-points, hence a different sequence of integers.

Shall we agree that neither bytes nor Unicode are strings? No, I don't 
think so either :-)
...
If I were to decode that string to text
and THEN uppercase it, it might give a quite different result:
Sure. If you perform *any* transformation on the data first, it might 
give a different result on uppercasing:

- if you reverse the bytes, uppercasing gives a different result;

- if you replace b'a' with b'e', uppercasing gives a different result

etc. And exactly the same observation applies to str objects:

- if you reverse the characters, uppercasing gives a different result;

- if you replace 'a' with 'e', uppercasing gives a different result.
...
And if you choose some other encoding than Latin-1, you might get
different results again.
Sure. The bytes methods like .upper() etc are predicated on the 
assumption that your bytes represent ASCII text. If your bytes represent 
something else, then calling the .upper() method may not be meaningful 
or useful.

In other words... if your bytes string came from an ASCII text file, 
it's probably safe to uppercase it. If your bytes string came from a 
JPEG, then uppercasing them will probably make a mess of the image, if 
not corrupt the file. So don't do that :-)

Analogy: ints support the unary minus operator. But if your int 
represents a mass, then negating it isn't meaningful. There's no such 
thing as -5 kg. Should we conclude from this that the int type in Python 
doesn't represent a number, and that the support of numeric operators 
and methods is merely for backwards compatibility? I think not.

The formal definition of a string is a sequence of symbols from an 
alphabet. That is precisely what bytes objects are: the alphabet in this 
case is the 8-bit numbers 0 to 255 inclusive, which for usefulness, 
convenience and backwards compatibility can be optionally interpreted as 
the 7-bit ASCII character set plus another 128 abstract "characters".
...
...
I said they were *strings*. Strings are not necessarily text, although
they often are. Formally, a string is a finite sequence of symbols that
are chosen from a set called an alphabet. See:
https://en.wikipedia.org/wiki/String_%28computer_science%29
A finite sequence of symbols... you mean like a list of integers
within the range [0, 255]? Nothing in that formal definition says that
a "string" of anything other than characters should be meaningfully
treated as text.
Sure. If your bytes don't represent text, then methods like upper() 
probably won't do anything meaningful. It's still a string though.
...
...
...
I don't think it's necessary to be too adamant about "must be some
sort of thing-we-call-string" here. Let practicality rule, since
purity has already waved a white flag at us.
It is because of *practicality* that we should prefer that things that
look similar should be similar. Code is read far more often that it is
written, and if you read two pieces of code that look similar, we should
strongly prefer that they should actually be similar.
And you have yet to prove that this similarity is actually a thing.
I'm not sure the onus is on me to prove this. "Status quo wins a 
stalemate." And surely the onus is on those proposing the new syntax to 
demonstrate that it will be fine to use string delimiters as function 
calls.

You could make a good start by finding other languages, reasonably 
conventional languages with syntax based on the Algol or C tradition, 
that use quotes '' or "" to return arbitrary types.

Even languages with unconventional syntax like Forth or APL would be a 
good start.

Maybe I'm wrong. Maybe quotation marks are widely used for purposes 
other than delimiting strings, and I'm just too ignorant of other 
languages to know it. Maybe Python is in the minority here.

Anyway, the bottom line is this:

I have no objection to using prefixed quotes to represent Unicode 
strings, or byte strings, or Andrew's hypothetical UTF-16 strings, or 
EBCDIC strings, or TRON strings.

https://en.wikipedia.org/wiki/TRON_(encoding)

But I think that any API that would allow z"..." to represent (let's 
say) a socket, or a float, or a HTTP_Server instance, or a list, would 
be a deeply flawed API.
...
Let's look at regular expressions. JavaScript has a syntax for them
involving leading and trailing slashes, borrowed from Perl, but I
can't figure out whether a regex is a first-class object in Perl. So
you can do something like this:
...
findme = /spa*m/
"This has spaaaaaam in it".match(findme)
[ 'spaaaaaam', index: 9, input: 'This has spaaaaaam in it' ]
In Python, I can do the exact same thing, only using double quotes as
the delimiter.
...
...
...
re.search("spa*m", "This has spaaaaam in it")

Sure. As a convenience, the re module has functions which accepts 
regular expression patterns as well as compiled regular expression 
objects.
...
So what do you mean by "non-string" exactly? In what way is a regular
expression "not a string",
That question is ambiguous. Are you asking about regular expression 
patterns, or regular expression objects?

Regular expression *patterns* are clearly strings:

    pattern = r'...'

We type them with string delimiters, if you call type(pattern) it will 
return str, you can slice the pattern or uppercase it.

Regular expression *objects* are just as clearly not strings:

    rx = re.compile(pattern)

You can't slice them, they aren't sequences of symbols, they have 
attributes like rx.flags which have no meaning in a string, they lack 
string methods like upper, and those methods they have operate very 
differently from their equivalent string methods:

    pattern.find("X")  # search pattern for "X"
    rx.search("X")  # search "X" for pattern

Regex objects are far more than just the regex pattern.
...
yet the byte-encoded form of an integer somehow is?
If your bytes represent an integer, then uppercasing them isn't 
meaningful. If your bytes represent ASCII text then uppercasing them 
may be meaningful.
...
It makes absolutely no sense to uppercase an integer, yet
you could uppercase a regex (since all regex special characters are
non-letters, this will make it match uppercase strings).
In general, you can't uppercase regex patterns without radically 
changing the meaning of them. Consider r'\d' and r'\D'.
...
Yet when you
encode the string as bytes, it gains an upper() method, and when you
encode a regex as a compiled regex object, it loses one. Why do you
insist that a regex is somehow not a string, but b"\xe7\x61" is?
Because a byte-string matches the definition of strings, while compiled 
regex objects do not.

-- 
Steven

[Python-ideas] Re: Custom string prefixes

Steven D'Aprano