Mailman 3 Re: Idea: Tagged strings in python - Python-ideas

newer
Call functools.update_wrapper on...

Re: Idea: Tagged strings in python

older
Re: Idea: Tagged strings in python

Brendan Barnwell

Dec. 19, 2022

8:09 p.m.

Sorry, accidentally replied off-list. . . On 2022-12-19 11:36, Chris Angelico wrote:

...

What it means for me for something to "be an HTML string" (or more precisely, to be an instance of HTMLString or whatever the class name is) is for it to be a string that has an extra tag attached to the object that means "this is HTML". That's it. You can make an HTML string that contains utter gobbledegook if you want. Of course, some operations may fail (like if it has a .validate() method) but that doesn't mean it's not still an instance of that class. Or, if you do want that, you can override the slicing method to raise an error if the result isn't valid HTML. The point is that overrides are for specifying the *new* behavior of the subclass (i.e., not allowing certain slice operations); you shouldn't have to override methods just to retain the superclass behavior. I mean, we were talking about this in the context of syntax highlighting. The utility of HTML-string highlighting would be seriously reduced if only *valid* HTML could be in an HTML string.

...

I can't run that example myself as I don't have Python 3.11 set up. But just from what you showed, I don't find it convincing. Enums are special in that they are specifically designed to allow only a fixed set of values. I see that as the uncommon case, rather than the common one of subclassing an "open-ended" class to create a new "open-ended" class (i.e., one that does not pre-specify exactly which values are allowed). So no, I don't think it would be more irritating. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

Show replies by date

Chris Angelico

December 2022

9:59 p.m.

New subject: Idea: Tagged strings in python

On Tue, 20 Dec 2022 at 07:13, Brendan Barnwell <brenbarn@brenbarn.net> wrote:

...

The enum module was added in Python 3.4. Nonetheless, a StrEnum is absolutely a str, and whatever you say about an HTML string has to also be valid for a StrEnum, or else the inverse is. The way things are, a StrEnum or an HTML string will behave *exactly as a string does*. The alternative is that, if any new operations are added to strings in the future, they have to be explicitly blocked by StrEnum or else they will randomly and mysteriously misbehave - or, at very best, crash with unexpected errors. Which one is more hostile to subclasses? ChrisA

Ethan Furman

1:53 a.m.

New subject: Idea: Tagged strings in python

On 12/19/22 13:59, Chris Angelico wrote:

...

As Brendan noted, mixed-type enums are special -- they are meant to be whatever they subclass, with a couple extra features/restrictions. Personally, every other time I've wanted to subclass a built-in data type, I've wanted the built-in methods to return my subclass, not the original class. All of which is to say: sometimes you want it one way, sometimes the other. ;-) Metaclasses, anyone? -- ~Ethan~

Chris Angelico

2:12 a.m.

New subject: Idea: Tagged strings in python

On Tue, 20 Dec 2022 at 12:55, Ethan Furman <ethan@stoneleaf.us> wrote:

...

On 12/19/22 13:59, Chris Angelico wrote:

...
The way things are, a StrEnum or an HTML string will behave *exactly as a string does*. The alternative is that, if any new operations are added to strings in the future, they have to be explicitly blocked by StrEnum or else they will randomly and mysteriously misbehave - or, at very best, crash with unexpected errors. Which one is more hostile to subclasses?

As Brendan noted, mixed-type enums are special -- they are meant to be whatever they subclass, with a couple extra features/restrictions.

Fair, but defaultdict also exhibits this behaviour, so maybe there are a number of special cases. Or, as Syndrome put it: "When everyone's [special]... no one will be."

...

Personally, every other time I've wanted to subclass a built-in data type, I've wanted the built-in methods to return my subclass, not the original class.

All of which is to say: sometimes you want it one way, sometimes the other. ;-)

Yep, sometimes each way. So the real question is not "would the opposite decision make sense in some situations?" but "which one is less of a problem when it's the wrong decision?". And I put it to you that returning an instance of the base type is less of a problem, in the same way that *any other* operation unaware of the subclass would behave. def underline(head): """Build an underline line for the given heading""" return "=" * len(head) Would you expect underline() to return the same type as head, or a plain str? Would this be true of every single function that returns something of the same kind as one of its parameters?

...

Metaclasses, anyone?

Hmm, how would they help? I do think that metaprogramming could help here, but not sure about metaclasses specifically. If I wanted to automate this, I'd go for something like this: @autospecialize class Str(str): def extra_method(self): ... where the autospecialize decorator would look at your class's first base class, figure out which methods should get this treatment (only if not overridden, only if they return that type, not __new__, maybe other rules), and then add a wrapper that returns __class__(self). But people will dispute parts of that. Maybe it should be explicitly told which base class to handle this way. Maybe it'd be better to have an intermediate class, rather than mutating the subclass. Maybe you should be explicit about which methods get autospecialized. It's not an easy problem, and simply returning the base class is the one option that you can be confident of. ChrisA

Steven D'Aprano

9:16 a.m.

New subject: Idea: Tagged strings in python

On Mon, Dec 19, 2022 at 05:53:38PM -0800, Ethan Furman wrote:

...

Personally, every other time I've wanted to subclass a built-in data type, I've wanted the built-in methods to return my subclass, not the original class.

Enums are special. But outside of enums, I cannot think of any useful situation where the desirable behaviour is for methods on a subclass to generally return a superclass rather than the type of self. Its normal behaviour for operations on a class K to return K instances, not some superclass of K. I dare say there are a few, but they don't come to mind.

...

All of which is to say: sometimes you want it one way, sometimes the other. ;-)

Yes, but one way is *overwhelmingly* more common than the other. Builtins make the rare form easy and the common form hard.

...

Metaclasses, anyone?

Oh gods, we shouldn't need to write a metaclass just to get methods that create instances of the calling class instead of one of its superclasses. -- Steve

Chris Angelico

9:52 a.m.

New subject: Idea: Tagged strings in python

On Tue, 20 Dec 2022 at 20:20, Steven D'Aprano <steve@pearwood.info> wrote:

...

How should it do that, if the constructor for K has a different signature from the constructor for K's superclass that is providing the method? How is the superclass to know how to return a K? Should the vanilla dict.__or__ method be able to take two defaultdicts and return a defaultdict, or is it reasonable to demand that, in this situation, defaultdict needs to define the method itself?

...

I'm not sure how dict.__or__ would be expected to cope with this situation. Yes, I'm sure it would be convenient. It would also have some extremely annoying consequences. ChrisA

Cameron Simpson

10:42 p.m.

New subject: Idea: Tagged strings in python

On 20Dec2022 20:16, Steven D'Aprano <steve@pearwood.info> wrote:

...

With str subtypes, the case that comes to my mind is mixing str subtypes. I happen to be wallowing in Django admin forms at the moment, and they have a mark_safe(some_html_here) function, which seems to return a str subtype (I infer - it's what I would be doing) - this is used in the templating engine to know that it _doesn't_ need to escape markup punctuation at render time. The default is escaping, to avoid accidental injection. So... I'd want an "html" str to support, say, addition to construct longer strings. html(str)+str should make a new html str with the plain str escaped. html(str)+css(str) should raise a TypeError. Etc etc. html(str).upper() might uppercase only the bits outside the tags i.e. "<b>foo</b>" -> "<b>FOO</b>". So, yes, for many methods I might reasonably expect a new html(str). But I can contrive situations where I'd want a plain str, and I'd be leery of "every method returns html(str)" by default - because such a string has substructure that seems to warrant careful thought. Cheers, Cameron Simpson <cs@cskk.id.au>

Steven D'Aprano

6 a.m.

New subject: Idea: Tagged strings in python

On Wed, Dec 21, 2022 at 09:42:51AM +1100, Cameron Simpson wrote:

...

The key word there is *contrive*. Obviously there are methods that are expected to return plain old strings. If you have a html.extract_content() method which extracts the body of the html document as plain text, stripping out all markup, there is no point returning a html object and a str will do. But most methods will need to keep the markup, and so they will need to return a html object. HTML is probably not the greatest example for this issue, because I expect that a full-blown HTML string subclass would probably have to override nearly all methods, so in this *specific* case the status quo is probably fine in practice. The status quo mostly hurts *lightweight* subclasses: class TurkishString(str): def upper(self): return TurkishString(str.upper(self.replace('i', 'İ'))) def lower(self): return TurkishString(str.lower(self.replace('I', 'ı'))) That's fine so long as the *only* operations you do to a TurkishString is upper or lower. As soon as you do concatenation, substring replacement, stripping, joining, etc you get a regular string. So we've gone from a lightweight subclass that needs to override two methods, to a heavyweight subclass that needs to override 30+ methods. This is probably why we don't rely on subclassing that much. Easier to just write a top-level function and forget about subclassing. -- Steve

David Mertz, Ph.D.

6:18 a.m.

New subject: Idea: Tagged strings in python

I'm on my tablet, so cannot test at the moment. But is `str.upper()` REALLY wrong about the Turkish dotless I (and dotted capital I) currently?! That feels like a BPO needed if true. On Wed, Dec 21, 2022, 1:04 AM Steven D'Aprano <steve@pearwood.info> wrote:

...

On Wed, Dec 21, 2022 at 09:42:51AM +1100, Cameron Simpson wrote:

...
With str subtypes, the case that comes to my mind is mixing str subtypes. [...] So, yes, for many methods I might reasonably expect a new html(str). But I can contrive situations where I'd want a plain str

The key word there is *contrive*.

Obviously there are methods that are expected to return plain old strings. If you have a html.extract_content() method which extracts the body of the html document as plain text, stripping out all markup, there is no point returning a html object and a str will do. But most methods will need to keep the markup, and so they will need to return a html object.

HTML is probably not the greatest example for this issue, because I expect that a full-blown HTML string subclass would probably have to override nearly all methods, so in this *specific* case the status quo is probably fine in practice. The status quo mostly hurts *lightweight* subclasses:

class TurkishString(str): def upper(self): return TurkishString(str.upper(self.replace('i', 'İ'))) def lower(self): return TurkishString(str.lower(self.replace('I', 'ı')))

That's fine so long as the *only* operations you do to a TurkishString is upper or lower. As soon as you do concatenation, substring replacement, stripping, joining, etc you get a regular string.

So we've gone from a lightweight subclass that needs to override two methods, to a heavyweight subclass that needs to override 30+ methods.

This is probably why we don't rely on subclassing that much. Easier to just write a top-level function and forget about subclassing.

-- Steve _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/Q6JQVE... Code of Conduct: http://python.org/psf/codeofconduct/

Chris Angelico

6:23 a.m.

New subject: Idea: Tagged strings in python

On Wed, 21 Dec 2022 at 17:20, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

...

I'm on my tablet, so cannot test at the moment. But is `str.upper()` REALLY wrong about the Turkish dotless I (and dotted capital I) currently?!

That feels like a BPO needed if true.

It's wrong about the ASCII i and I, which upper and lower case to each other. There's no way for str.upper() to be told what language it's working with, so it goes with a default that's valid for every language except Turkish and its friends. This also means that lowercasing "İ" will give "i" which uppercases to "I", so it doesn't round-trip. There is no solution other than a language-aware case transformation. ChrisA

David Mertz, Ph.D.

6:39 a.m.

New subject: Idea: Tagged strings in python

Oh yeah. Good points! Do we need a PEP for str.upper() to grow an optional 'locale' argument? I feel like there are examples other than the Turkish i's where this matters, but it's past my bedtime, so they aren't coming to mind. Maybe Koine Greek which had not adopted the miniscule/majuscule distinction of post 10th century CE that modern Greek inherited. I feel like `s.upper(locale='koine')` might sensibly account for this. On Wed, Dec 21, 2022, 1:23 AM Chris Angelico <rosuav@gmail.com> wrote:

...

Chris Angelico

6:52 a.m.

New subject: Idea: Tagged strings in python

On Wed, 21 Dec 2022 at 17:39, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

...

Oh yeah. Good points! Do we need a PEP for str.upper() to grow an optional 'locale' argument? I feel like there are examples other than the Turkish i's where this matters, but it's past my bedtime, so they aren't coming to mind.

I don't think str.upper() is the place for it; Python has a locale module that is a better fit for this. (That's where you'd go if you want to alphabetize strings with proper respect to language, for instance.) But it's a difficult problem. Some languages have different case-folding rules depending on whether you're uppercasing a name or some other word. German needs to know whether something's a noun, because even when lowercased, they have an initial capital letter. The Unicode standard offers a reasonably-generic set of tools, including for case folding. If you feel like delving deep, the standard talks about case conversions in section 3.13 - about a hundred and fifty pages into this document: https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf - but as far as I know, that's still locale-agnostic. I think (though I haven't checked) that Python's str.upper/str.lower follow these rules. Anything non-generic would be a gigantic task, not well suited to the core string type, as it would need to be extremely context-sensitive. Anyone who needs that kind of functionality should probably be reaching for the locale module for other reasons anyway, so IMO that would be a better place for a case-conversion toolkit. ChrisA

Stephen J. Turnbull

9:02 a.m.

New subject: Idea: Tagged strings in python

Chris Angelico writes:

...

I don't think str.upper() is the place for it; Python has a locale module that is a better fit for this.

Many would argue that (POSIX) locales aren't a good fit for anything. :-) I agree that it's kind of hard to see anything more complex than a fixed table for the entire Unicode repertoire belonging in str, though. (I admit that my feeling toward Erdogan makes me less sympathetic to the Turks. :-) Use of locale notation in keys for more sophisticated treatment is hard to beat as far as I know, though. Steve

Steven D'Aprano

9:58 a.m.

New subject: Idea: Tagged strings in python

On Fri, Dec 23, 2022 at 06:02:39PM +0900, Stephen J. Turnbull wrote:

...

Many would argue that (POSIX) locales aren't a good fit for anything. :-)

:-)

...

I agree that it's kind of hard to see anything more complex than a fixed table for the entire Unicode repertoire belonging in str, though.

I think for practical reasons, we don't want to overload the builtin str class with excessive complexity. But the string module? Or third-party libraries?

...

(I admit that my feeling toward Erdogan makes me less sympathetic to the Turks. :-)

Does that include the 70% or more Turks who disapprove of Erdoğan? There are at least 35 surviving Turkic languages, including Azerbaijani, Turkmen, Qashqai, Balkan Gagauz, and Tatar. Although Turkish is the single largest of them, it only makes up about 38% of all Turkic speakers. All up, there are about 200 million speakers of Turkic languages. That's more than Germanic languages (excluding English) or Japanese. If any special case should be a special case, it is the Turkish I Problem. But as I said, probably not in the builtin str class. -- Steve

Chris Angelico

11:27 a.m.

New subject: Idea: Tagged strings in python

On Fri, 23 Dec 2022 at 21:02, Steven D'Aprano <steve@pearwood.info> wrote:

...

Not really convinced that it belongs in string, but it could go in unicodedata (if it's lifted straight from the Unicode standards and associated data files), locale, or any sort of third-party library. I think this would be a useful feature to have, although it'll probably end up needing a LOT of information (you can't just say "give me a locale-correct uppercasing of this string" without further context). So IMO it should be third-party. Went looking on PyPI to see what already exists, but I didn't find anything. Might just be that I didn't pick the right keywords to look for though. ChrisA

Cameron Simpson

10:03 p.m.

New subject: Idea: Tagged strings in python

On 23Dec2022 22:27, Chris Angelico <rosuav@gmail.com> wrote:

...

It would probably be good to have a caveat mentioning these context difficulties in the docs of the unicodedata and str/string case fiddling methods. Not a complete exposition, but making it clear that for some languages the rules require context, maybe with a hard-to-implement-correctly example of naive/incorrect use. Cheers, Cameron Simpson <cs@cskk.id.au>

Chris Angelico

10:11 p.m.

New subject: Idea: Tagged strings in python

On Sat, 24 Dec 2022 at 09:07, Cameron Simpson <cs@cskk.id.au> wrote:

...

Do people actually read those warnings? Hang on, lemme pop into the time machine and add one to the docstring and docs for str.upper(). Okay, I'm back. Tell me, have you read the docstring? Do you know exactly what it says? For example, is there wording that clarifies whether x.upper() uppercases the string in-place? (I had to actually check that one myself, as I haven't memorized the docstring either.) ChrisA

Cameron Simpson

2:13 a.m.

New subject: Idea: Tagged strings in python

On 24Dec2022 09:11, Chris Angelico <rosuav@gmail.com> wrote:

...

On Sat, 24 Dec 2022 at 09:07, Cameron Simpson <cs@cskk.id.au> wrote:

...
On 23Dec2022 22:27, Chris Angelico <rosuav@gmail.com> wrote:

...
I think this would be a useful feature to have, although it'll probably end up needing a LOT of information (you can't just say "give me a locale-correct uppercasing of this string" without further context). So IMO it should be third-party.

It would probably be good to have a caveat mentioning these context difficulties in the docs of the unicodedata and str/string case fiddling methods. Not a complete exposition, but making it clear that for some languages the rules require context, maybe with a hard-to-implement-correctly example of naive/incorrect use.

Do people actually read those warnings?

I have read them, I think, though not for a while.

...

Hang on, lemme pop into the time machine and add one to the docstring and docs for str.upper(). Okay, I'm back. Tell me, have you read the docstring?

Python 3.9.13 (main, Aug 11 2022, 14:01:42) [Clang 12.0.0 (clang-1200.0.32.29)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> help(str.upper) Help on method_descriptor: upper(self, /) Return a copy of the string converted to uppercase. Hmm. Did you commit the change? Is the key to the time machine back on its hook? Docs: str.upper() Return a copy of the string with all the cased characters 4 converted to uppercase. Note that s.upper().isupper() might be False if s contains uncased characters or if the Unicode category of the resulting character(s) is not “Lu” (Letter, uppercase), but e.g. “Lt” (Letter, titlecase). The uppercasing algorithm used is described in section 3.13 of the Unicode Standard. and [4] here: Cased characters are those with general category property being one of “Lu” (Letter, uppercase), “Ll” (Letter, lowercase), or “Lt” (Letter, titlecase).

...

wording that clarifies whether x.upper() uppercases the string in-place?

Well, it says "a copy", so I'd say it's clear. I've only got version 5.0 of Unicode here. [steps into the other room...] Thank you, I see you used the time machine to buy me version 9.0 too :-) Ah, 3.13 is 7 pages of compact text here. I was thinking of something a bit more general, like "case changing is a complex language and context dependent process, and use of str.upper (etc....) therefore perform a simplistic operation". Cheers, Cameron Simpson <cs@cskk.id.au>

Chris Angelico

3:35 a.m.

New subject: Idea: Tagged strings in python

On Sat, 24 Dec 2022 at 13:15, Cameron Simpson <cs@cskk.id.au> wrote:

...

My question was more: do you know, or do you have to look? I'll take another example. Take the list.index() method, which returns the index where a thing can be found. *Without checking first*, answer these questions: 1) Will identical-but-not-equal values (eg the same instance of nan) be found? 2) Do the docs and/or docstring tell you the answer to question 1? And then a logical followup: 3) If your answer to question 1 was incorrect, {does it help, would it have helped} to have a note in the docs? ChrisA

Cameron Simpson

3:56 a.m.

New subject: Idea: Tagged strings in python

On 24Dec2022 14:35, Chris Angelico <rosuav@gmail.com> wrote:

...

On Sat, 24 Dec 2022 at 13:15, Cameron Simpson <cs@cskk.id.au> wrote: My question was more: do you know, or do you have to look? I'll take another example. Take the list.index() method, which returns the index where a thing can be found. *Without checking first*, answer these questions:

1) Will identical-but-not-equal values (eg the same instance of nan) be found?

I'd say no, because it should need to do an equality check to compare things. Let's see: >>> from math import nan >>> nan == nan False >>> L = [nan] >>> L.index(nan) 0 Well, I'm wrong. Assuming it does a precheck with "is", I guess.

...

2) Do the docs and/or docstring tell you the answer to question 1?

[ To the docs!... How disappointing, the "Index" href at top right does not take me directly to list.index :-) ] help(list.index) seems empty. The best docs seem to be .index for sequences: https://docs.python.org/3/library/stdtypes.html#index-19 which only allude to the intended semantics: index of the first occurrence of x in s (at or after index i and before index j) I guess that though _could_ imply an object identify check, but if I read it that way I might really take it to mean identity ("occurrence of x in s"), rather than the more useful and expected "also equal". The "nan" test says to me that there's an identity check, which is at least quite fast and maybe significantly useful for types which intern small values like int. This test says there's a fallback to an equality test: >>> L1=[3] >>> L2=[3] >>> L1 is L2 False >>> L1 == L2 True >>> L1 in [L2] True

...

And then a logical followup:

3) If your answer to question 1 was incorrect, {does it help, would it have helped} to have a note in the docs?

It would help to be able to understand the behaviour. I think with `list.index` I'd expect an equality test only (I was surprised by your "nan" example, even though "nan" is a pretty unusual value). Cheers, Cameron Simpson <cs@cskk.id.au>

Chris Angelico

4:12 a.m.

New subject: Idea: Tagged strings in python

On Sat, 24 Dec 2022 at 15:00, Cameron Simpson <cs@cskk.id.au> wrote:

...

On 24Dec2022 14:35, Chris Angelico <rosuav@gmail.com> wrote:

...
On Sat, 24 Dec 2022 at 13:15, Cameron Simpson <cs@cskk.id.au> wrote: My question was more: do you know, or do you have to look? I'll take another example. Take the list.index() method, which returns the index where a thing can be found. *Without checking first*, answer these questions:

1) Will identical-but-not-equal values (eg the same instance of nan) be found?

I'd say no, because it should need to do an equality check to compare things.

Well, I'm wrong. Assuming it does a precheck with "is", I guess.

Indeed; even experts don't always know this. The check used is the "identity-or-equality" check common to a lot of containment checks in Python.

...

...
2) Do the docs and/or docstring tell you the answer to question 1?

[ To the docs!... How disappointing, the "Index" href at top right does not take me directly to list.index :-) ]

...

help(list.index) seems empty.

Huh that's strange. I'm checking this in a few recent versions, and they all say "Return first index of value".

...

The best docs seem to be .index for sequences: https://docs.python.org/3/library/stdtypes.html#index-19 which only allude to the intended semantics:

index of the first occurrence of x in s (at or after index i and before index j)

I guess that though _could_ imply an object identify check, but if I read it that way I might really take it to mean identity ("occurrence of x in s"), rather than the more useful and expected "also equal".

The docs are a bit lax about mentioning the identity check (see further up in the same table, where containment is described in terms of equality; in actual fact, nan in [nan] will be True), but this seldom comes up in practice because most objects are equal to themselves.

...

The "nan" test says to me that there's an identity check, which is at least quite fast and maybe significantly useful for types which intern small values like int.

Right. I think that, originally, the identity check was considered to be merely an optimization, and values that compare unequal to themselves were considered weird outliers; it's only more recently that it was decided that the "identity-or-equality" semantic check was worth documenting. (No behavioural change, just a change in the attitude towards "weird values".)

...

...
And then a logical followup:

3) If your answer to question 1 was incorrect, {does it help, would it have helped} to have a note in the docs?

It would help to be able to understand the behaviour. I think with `list.index` I'd expect an equality test only (I was surprised by your "nan" example, even though "nan" is a pretty unusual value).

It might help in those rare instances where you think to go and read the docs, which basically means the times when something looks wrong to you. It almost certainly won't help for cases where someone doesn't recognize a problem. With string uppercasing, anyone who would think to look in the docs for how it handles locale-specific case conversions is already going to understand that it can't possibly do anything more than a generic translation table, so I don't think it'd buy anything to have a note in the docs. ChrisA

Cameron Simpson

5:46 a.m.

New subject: Idea: Tagged strings in python

On 24Dec2022 15:12, Chris Angelico <rosuav@gmail.com> wrote:

...

Ugh. It isn't empty. But on my local system the pager help() invokes seems to use the alternate screen, so when I exit it the help's gone. I really need to debug that, it's incredibly annoying.

...

Often.

...

I'm not so sure. For example, my naive inclination with maybe have been to look to see if it paid attention to say the POSIX locale, and blithely assume that such attention might be enough. Though I wonder if acting differently for locales might amount to mojibake sometimes. Cheers, Cameron Simpson <cs@cskk.id.au>

Steven D'Aprano

7:30 a.m.

New subject: Idea: Tagged strings in python

On Wed, Dec 21, 2022 at 01:18:46AM -0500, David Mertz, Ph.D. wrote:

...

I'm on my tablet, so cannot test at the moment. But is `str.upper()` REALLY wrong about the Turkish dotless I (and dotted capital I) currently?!

It has to be. Turkic languages like Turkish, Azerbaijani and Tatar distinguish dotted and non-dotted I's, leading to a slew of problems infamously known as "The Turkish I problem". (Other languages use undotted i's but not in the same way, e.g. Irish roadsigns in Gaelic usually drop the dot to avoid confusion with í. And don't confuse the undotted i with the Latin iota ɩ, which is a completely different letter to the Greek iota ι. Alphabets are hard.) In Turkic languages, we have: Letter: ı I i İ ----------- --- --- --- --- Lowercase: ı ı i i Uppercase: I I İ İ Swapping case can never add or remove a dot. (The technical name for the dot is "tittle".) Which is perfectly logical, of course. But most other people with Latin-based alphabets mix the dotted and dotless letters together, leading to this lossy table: Letter: ı I i İ ----------- --- --- --- --- Lowercase: ı i i i Uppercase: I I I İ which is the official Unicode case conversion, which Python follows.

...

...
...
"ıIiİ".lower() 'ıiii̇' "ıIiİ".upper() 'IIIİ'

Just to make the Turkish I problem even more exciting, you aren't supposed to use Turkish rules when changing the case of foreign proper nouns. So the popular children's book "Alice Harikalar Diyarında" (Alice in Wonderland) should use *both* sets of rules when uppercasing to give "ALICE HARİKALAR DİYARINDA". Sometimes the dot can be very significant. https://gizmodo.com/a-cellphones-missing-dot-kills-two-people-puts-three-m-3...

...

That feels like a BPO needed if true.

We do whatever the Unicode standard says to do. They say that localisation issues are out of scope for Unicode. -- Steve

Chris Angelico

6:20 a.m.

New subject: Idea: Tagged strings in python

On Wed, 21 Dec 2022 at 17:03, Steven D'Aprano <steve@pearwood.info> wrote:

...

Also not a great example, honestly. Part of the problem is that there *are no good examples*. You need something that subclasses a core data type, does not change its constructor in any way, and needs to always get back another of itself when any method is called. But in every other way, it is its superclass. I think defaultdict comes close, but it changes the constructor's signature; StrEnum and IntEnum come very close, but apparently they're just special cases and can be written off as irrelevant; there really aren't that many situations where this even comes up. Part of the problem is that it's really not clear which methods should return "the same type" and which should return a core data type. Clearly __len__ on a string should return a vanilla integer, regardless of the precise class of string; but should bit_length on an integer return an int of the subclass, or a plain integer? Is it different just because it happens to return the same data type on a vanilla int? What about as_integer_ratio() - should that return a tuple of two of the same type, or should it return (self, 1) ? Remember that you're asking the int type to make these decisions globally. ChrisA

Cameron Simpson

6:31 a.m.

New subject: Idea: Tagged strings in python

On 21Dec2022 17:00, Steven D'Aprano <steve@pearwood.info> wrote:

...

On Wed, Dec 21, 2022 at 09:42:51AM +1100, Cameron Simpson wrote:

...
With str subtypes, the case that comes to my mind is mixing str subtypes. [...] So, yes, for many methods I might reasonably expect a new html(str). But I can contrive situations where I'd want a plain str

The key word there is *contrive*.

Surely. I think my notion is that most of the ad hoc lexical str methods don't know anything about a str-with-special-semantics and therefore may well generally want to return a plain str, so it isn't the disasterous starting point I think you're suggesting. Obviously that's a generalisation.

...

Obviously there are methods that are expected to return plain old strings. If you have a html.extract_content() method which extracts the body of the html document as plain text, stripping out all markup, there is no point returning a html object and a str will do. But most methods will need to keep the markup, and so they will need to return a html object.

Hypothetical. I'm not sure I entirely agree. I think we can both agree there will be methods which _should_ return a str and methods which should return the same type as the source object. How the mix plays out depends on the class.

...

[...] The status quo mostly hurts *lightweight* subclasses:

class TurkishString(str): def upper(self): return TurkishString(str.upper(self.replace('i', 'İ'))) def lower(self): return TurkishString(str.lower(self.replace('I', 'ı')))

That's fine so long as the *only* operations you do to a TurkishString is upper or lower. As soon as you do concatenation, substring replacement, stripping, joining, etc you get a regular string.

So we've gone from a lightweight subclass that needs to override two methods, to a heavyweight subclass that needs to override 30+ methods.

I think __getattribute__ may be the go here. There's a calling cost of course, but you could fairly easily write a __getattribute__ which (a) checked for a superclass matching method and (b) special cases a few methods, and otherwise made all methods return either the same class (TurkishString) or plain str depending on the majority method flavour. In fact, if I were doing this for real I might make a mixing or intermediate class with such a __getattribute__, provided there was a handy TurkishString(str)-ilke call to promote a plain str back into the subclass. (My personal preference is solidifying to a .promote(anything) method, which is a discuassion for elsewhere.)

...

This is probably why we don't rely on subclassing that much. Easier to just write a top-level function and forget about subclassing.

Oooh, I do a _lot_ of subclassing :-) Cheers, Cameron Simpson <cs@cskk.id.au>

Rob Cliffe

12:31 a.m.

New subject: Idea: Tagged strings in python

On 20/12/2022 09:16, Steven D'Aprano wrote:

...

Caveat: If you were subclassing str, you would probably want __str__ and __repr__ (if you were not overriding them) to return plain strings. Best wishes Rob Cliffe

Brendan Barnwell

2:55 a.m.

New subject: Idea: Tagged strings in python

On 2022-12-19 13:59, Chris Angelico wrote:

...

Your example used StrEnum, which was added in Python 3.11.

...

No, it doesn't, because HTMLString and StrEnum can be different subclasses of str with different behavior. You seem to be missing the concept of subclasses here. Yes, a StrEnum may be an instance of str, and an HTMLString may also be an instance of str. But that does not mean the behavior of both needs to be same. They are instances of *different subclasses* of str and can have *different behavior*. An instance of collections.Counter is an instance of dict and so is an instance of collections.defaultdict, but that doesn't mean that anything I say about a Counter has to be valid for a defaultdict. One way that some libraries implement this for their own classes is to have an attribute or method called something like `_class` or `_constructor` that specifies which class to use to construct a new instance when needed. By default such a class may return an instance of the same type as self (i.e., the most specific subclass), but subclasses could override it to do something else.

...

I already answered that in my previous post. To repeat: StrEnum is the unusual case and I am fine with it being more difficult to create something like StrEnum, because that is not as important as making it easy to create classes that *do* return an instance of themselves (i.e., an instance of the same type as "self") from their various methods. The current behavior is more hostile to subclasses because people typically write subclasses to *extend* the behavior of superclasses, and that is hindered if you have to override every superclass method just to make it do the same thing but return the result wrapped in the new subclass. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

Chris Angelico

3:15 a.m.

New subject: Idea: Tagged strings in python

On Tue, 20 Dec 2022 at 13:56, Brendan Barnwell <brenbarn@brenbarn.net> wrote:

...

Oh! My apologies. The older way of spelling it multiple inheritance but comes to the same thing; it's still very definitely a string. StrEnum is a lot more convenient, and I've been using 3.11 for long enough that I forgot when it came in. Even back in 3.5 (the oldest docs that I have handy), the notion of enum MI was listed as a recommended method: https://docs.python.org/3.5/library/enum.html#others

...

Other than that change to the signature, the demonstration behaves exactly the same (I just tested it on 3.5). Again, my apologies for unintentionally providing an example that works only on very new Pythons.

...

That is very true, but whenever the subclass is NOT the same as the superclass, you provide functionality to do so. Otherwise, the normal assumption should be that it behaves identically. For instance, if you iterate over a Counter, you would expect to get all of the keys in it; it's true that you can subscript it with any value and get back a zero, but the default behaviour of Counter iteration is to do the same thing that a dict would. And that's what we generally see. A StrEnum is a str, and any behaviours that aren't the same as str are provided by StrEnum (for instance, it has a different __repr__). But for anything that isn't overridden - including any new functionality, if you upgrade Python and keep the same StrEnum code - you get the superclass's behaviour.

...

I'm of the opinion that this is a lot less special than you might think, since there are quite a lot of these sorts of special cases.

...

Maybe, but I would say that the solution is to make an easier way to make a subclass that automatically does those changes - not to make this the behaviour of all classes, everywhere. Your idea to:

...

... have a _class attribute may be a good way to do this, since - unless otherwise overridden - it would remain where it is. (Though, minor bikeshedding - a dunder name is probably more appropriate here.) It could even be done with a mixin: class Str(autospecialize, str): __autospecialize__ = __class__ def some_method(self): ... and then the autospecialize class can handle this. There are many ways of handling this, and IMO the best *default* is to not do this. ChrisA

Stephen J. Turnbull

5:19 a.m.

New subject: Idea: Tagged strings in python

Brendan Barnwell writes:

...

I don't like tags that lie. Seems pointless (see below).

...

Do you mean "retain the subclass behavior" here? AFAICS what's being called "hostile" is precisely retaining *superclass* behavior.

...

The proposed HTMLstring *class* is irrelevant to syntax highlighting, regardless of its functionality. The OP (and his syntax-highlighting text editor!) wants standard literal syntax *in source code* that allows an editor-that-is-not-as-programmable-as-emacs-or-vim to recognize a fragment of text (typically in a literal string) that is supposed to be highlighted as HTML. Syntax highlighting is not aided by an HTMLstring object in the *running Python program*. I really don't understand what value your HTMLstring as str + tag provides to the OP, or to a Python program. I guess that an editor written in Python could manipulate a list of TaggedString objects, but this is a pretty impoverished model. Emacsen have had extents/ overlays since 1990 or so, which can be nested or overlap, and nesting and overlapping are both needed for source code highlighing.[1][2] I don't take a position on the "builtins are hostile to subclassing" debate. I can't recall ever noticing the problem, so I'll let you all handle that. :-) Footnotes: [1] In Emacsen, tagged source text (overlays) is used not only for syntax highlighting which presumably is nested (but TagSoup HTML!), but also to implement things like hiding text, which is an operation on raw text that need not respect any syntax. [2] XEmacs's implementation of syntax highlighting actually works in terms of "extent fragments" which are non-overlapping, but they're horrible to work with from a editor API standpoint. They're used only in the implementation of the GUI display, for performance reasons, and each one typically contains a plethora of tags.

Chris Angelico

December 2022

9:59 p.m.

New subject: Idea: Tagged strings in python

On Tue, 20 Dec 2022 at 07:13, Brendan Barnwell <brenbarn@brenbarn.net> wrote:

...

Ethan Furman

1:53 a.m.

New subject: Idea: Tagged strings in python

On 12/19/22 13:59, Chris Angelico wrote:

...

Chris Angelico

2:12 a.m.

New subject: Idea: Tagged strings in python

On Tue, 20 Dec 2022 at 12:55, Ethan Furman <ethan@stoneleaf.us> wrote:

...

On 12/19/22 13:59, Chris Angelico wrote:

...
The way things are, a StrEnum or an HTML string will behave *exactly as a string does*. The alternative is that, if any new operations are added to strings in the future, they have to be explicitly blocked by StrEnum or else they will randomly and mysteriously misbehave - or, at very best, crash with unexpected errors. Which one is more hostile to subclasses?

As Brendan noted, mixed-type enums are special -- they are meant to be whatever they subclass, with a couple extra features/restrictions.

Fair, but defaultdict also exhibits this behaviour, so maybe there are a number of special cases. Or, as Syndrome put it: "When everyone's [special]... no one will be."

...

Personally, every other time I've wanted to subclass a built-in data type, I've wanted the built-in methods to return my subclass, not the original class.

All of which is to say: sometimes you want it one way, sometimes the other. ;-)

...

Metaclasses, anyone?

Steven D'Aprano

9:16 a.m.

New subject: Idea: Tagged strings in python

On Mon, Dec 19, 2022 at 05:53:38PM -0800, Ethan Furman wrote:

...

Personally, every other time I've wanted to subclass a built-in data type, I've wanted the built-in methods to return my subclass, not the original class.

...

All of which is to say: sometimes you want it one way, sometimes the other. ;-)

Yes, but one way is *overwhelmingly* more common than the other. Builtins make the rare form easy and the common form hard.

...

Metaclasses, anyone?

Oh gods, we shouldn't need to write a metaclass just to get methods that create instances of the calling class instead of one of its superclasses. -- Steve

Chris Angelico

9:52 a.m.

New subject: Idea: Tagged strings in python

On Tue, 20 Dec 2022 at 20:20, Steven D'Aprano <steve@pearwood.info> wrote:

...

I'm not sure how dict.__or__ would be expected to cope with this situation. Yes, I'm sure it would be convenient. It would also have some extremely annoying consequences. ChrisA

Cameron Simpson

10:42 p.m.

New subject: Idea: Tagged strings in python

On 20Dec2022 20:16, Steven D'Aprano <steve@pearwood.info> wrote:

...

Steven D'Aprano

December 2022

6 a.m.

New subject: Idea: Tagged strings in python

On Wed, Dec 21, 2022 at 09:42:51AM +1100, Cameron Simpson wrote:

...

David Mertz, Ph.D.

6:18 a.m.

New subject: Idea: Tagged strings in python

...

On Wed, Dec 21, 2022 at 09:42:51AM +1100, Cameron Simpson wrote:

...
With str subtypes, the case that comes to my mind is mixing str subtypes. [...] So, yes, for many methods I might reasonably expect a new html(str). But I can contrive situations where I'd want a plain str

The key word there is *contrive*.

Obviously there are methods that are expected to return plain old strings. If you have a html.extract_content() method which extracts the body of the html document as plain text, stripping out all markup, there is no point returning a html object and a str will do. But most methods will need to keep the markup, and so they will need to return a html object.

HTML is probably not the greatest example for this issue, because I expect that a full-blown HTML string subclass would probably have to override nearly all methods, so in this *specific* case the status quo is probably fine in practice. The status quo mostly hurts *lightweight* subclasses:

class TurkishString(str): def upper(self): return TurkishString(str.upper(self.replace('i', 'İ'))) def lower(self): return TurkishString(str.lower(self.replace('I', 'ı')))

That's fine so long as the *only* operations you do to a TurkishString is upper or lower. As soon as you do concatenation, substring replacement, stripping, joining, etc you get a regular string.

So we've gone from a lightweight subclass that needs to override two methods, to a heavyweight subclass that needs to override 30+ methods.

This is probably why we don't rely on subclassing that much. Easier to just write a top-level function and forget about subclassing.

-- Steve _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/Q6JQVE... Code of Conduct: http://python.org/psf/codeofconduct/

Chris Angelico

6:23 a.m.

New subject: Idea: Tagged strings in python

On Wed, 21 Dec 2022 at 17:20, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

...

I'm on my tablet, so cannot test at the moment. But is `str.upper()` REALLY wrong about the Turkish dotless I (and dotted capital I) currently?!

That feels like a BPO needed if true.

David Mertz, Ph.D.

6:39 a.m.

New subject: Idea: Tagged strings in python

...

Chris Angelico

6:52 a.m.

New subject: Idea: Tagged strings in python

On Wed, 21 Dec 2022 at 17:39, David Mertz, Ph.D. <david.mertz@gmail.com> wrote:

...

Oh yeah. Good points! Do we need a PEP for str.upper() to grow an optional 'locale' argument? I feel like there are examples other than the Turkish i's where this matters, but it's past my bedtime, so they aren't coming to mind.

Stephen J. Turnbull

9:02 a.m.

New subject: Idea: Tagged strings in python

Chris Angelico writes:

...

I don't think str.upper() is the place for it; Python has a locale module that is a better fit for this.

Steven D'Aprano

December 2022

9:58 a.m.

New subject: Idea: Tagged strings in python

On Fri, Dec 23, 2022 at 06:02:39PM +0900, Stephen J. Turnbull wrote:

...

Many would argue that (POSIX) locales aren't a good fit for anything. :-)

:-)

...

I agree that it's kind of hard to see anything more complex than a fixed table for the entire Unicode repertoire belonging in str, though.

I think for practical reasons, we don't want to overload the builtin str class with excessive complexity. But the string module? Or third-party libraries?

...

(I admit that my feeling toward Erdogan makes me less sympathetic to the Turks. :-)

Chris Angelico

11:27 a.m.

New subject: Idea: Tagged strings in python

On Fri, 23 Dec 2022 at 21:02, Steven D'Aprano <steve@pearwood.info> wrote:

...

Cameron Simpson

10:03 p.m.

New subject: Idea: Tagged strings in python

On 23Dec2022 22:27, Chris Angelico <rosuav@gmail.com> wrote:

...

Chris Angelico

10:11 p.m.

New subject: Idea: Tagged strings in python

On Sat, 24 Dec 2022 at 09:07, Cameron Simpson <cs@cskk.id.au> wrote:

...

Cameron Simpson

2:13 a.m.

New subject: Idea: Tagged strings in python

On 24Dec2022 09:11, Chris Angelico <rosuav@gmail.com> wrote:

...

On Sat, 24 Dec 2022 at 09:07, Cameron Simpson <cs@cskk.id.au> wrote:

...
On 23Dec2022 22:27, Chris Angelico <rosuav@gmail.com> wrote:

...
I think this would be a useful feature to have, although it'll probably end up needing a LOT of information (you can't just say "give me a locale-correct uppercasing of this string" without further context). So IMO it should be third-party.

It would probably be good to have a caveat mentioning these context difficulties in the docs of the unicodedata and str/string case fiddling methods. Not a complete exposition, but making it clear that for some languages the rules require context, maybe with a hard-to-implement-correctly example of naive/incorrect use.

Do people actually read those warnings?

I have read them, I think, though not for a while.

...

Hang on, lemme pop into the time machine and add one to the docstring and docs for str.upper(). Okay, I'm back. Tell me, have you read the docstring?

...

wording that clarifies whether x.upper() uppercases the string in-place?

Chris Angelico

3:35 a.m.

New subject: Idea: Tagged strings in python

On Sat, 24 Dec 2022 at 13:15, Cameron Simpson <cs@cskk.id.au> wrote:

...

Cameron Simpson

December 2022

3:56 a.m.

New subject: Idea: Tagged strings in python

On 24Dec2022 14:35, Chris Angelico <rosuav@gmail.com> wrote:

...

On Sat, 24 Dec 2022 at 13:15, Cameron Simpson <cs@cskk.id.au> wrote: My question was more: do you know, or do you have to look? I'll take another example. Take the list.index() method, which returns the index where a thing can be found. *Without checking first*, answer these questions:

1) Will identical-but-not-equal values (eg the same instance of nan) be found?

...

2) Do the docs and/or docstring tell you the answer to question 1?

...

And then a logical followup:

3) If your answer to question 1 was incorrect, {does it help, would it have helped} to have a note in the docs?

Chris Angelico

4:12 a.m.

New subject: Idea: Tagged strings in python

On Sat, 24 Dec 2022 at 15:00, Cameron Simpson <cs@cskk.id.au> wrote:

...

On 24Dec2022 14:35, Chris Angelico <rosuav@gmail.com> wrote:

...
On Sat, 24 Dec 2022 at 13:15, Cameron Simpson <cs@cskk.id.au> wrote: My question was more: do you know, or do you have to look? I'll take another example. Take the list.index() method, which returns the index where a thing can be found. *Without checking first*, answer these questions:

1) Will identical-but-not-equal values (eg the same instance of nan) be found?

I'd say no, because it should need to do an equality check to compare things.

Well, I'm wrong. Assuming it does a precheck with "is", I guess.

Indeed; even experts don't always know this. The check used is the "identity-or-equality" check common to a lot of containment checks in Python.

...

...
2) Do the docs and/or docstring tell you the answer to question 1?

[ To the docs!... How disappointing, the "Index" href at top right does not take me directly to list.index :-) ]

...

help(list.index) seems empty.

Huh that's strange. I'm checking this in a few recent versions, and they all say "Return first index of value".

...

The best docs seem to be .index for sequences: https://docs.python.org/3/library/stdtypes.html#index-19 which only allude to the intended semantics:

index of the first occurrence of x in s (at or after index i and before index j)

I guess that though _could_ imply an object identify check, but if I read it that way I might really take it to mean identity ("occurrence of x in s"), rather than the more useful and expected "also equal".

...

The "nan" test says to me that there's an identity check, which is at least quite fast and maybe significantly useful for types which intern small values like int.

...

...
And then a logical followup:

3) If your answer to question 1 was incorrect, {does it help, would it have helped} to have a note in the docs?

It would help to be able to understand the behaviour. I think with `list.index` I'd expect an equality test only (I was surprised by your "nan" example, even though "nan" is a pretty unusual value).

Cameron Simpson

5:46 a.m.

New subject: Idea: Tagged strings in python

On 24Dec2022 15:12, Chris Angelico <rosuav@gmail.com> wrote:

...

Ugh. It isn't empty. But on my local system the pager help() invokes seems to use the alternate screen, so when I exit it the help's gone. I really need to debug that, it's incredibly annoying.

...

Often.

...

Steven D'Aprano

7:30 a.m.

New subject: Idea: Tagged strings in python

On Wed, Dec 21, 2022 at 01:18:46AM -0500, David Mertz, Ph.D. wrote:

...

I'm on my tablet, so cannot test at the moment. But is `str.upper()` REALLY wrong about the Turkish dotless I (and dotted capital I) currently?!

...

...
...
"ıIiİ".lower() 'ıiii̇' "ıIiİ".upper() 'IIIİ'

...

That feels like a BPO needed if true.

We do whatever the Unicode standard says to do. They say that localisation issues are out of scope for Unicode. -- Steve

Chris Angelico

6:20 a.m.

New subject: Idea: Tagged strings in python

On Wed, 21 Dec 2022 at 17:03, Steven D'Aprano <steve@pearwood.info> wrote:

...

Cameron Simpson

6:31 a.m.

New subject: Idea: Tagged strings in python

On 21Dec2022 17:00, Steven D'Aprano <steve@pearwood.info> wrote:

...

On Wed, Dec 21, 2022 at 09:42:51AM +1100, Cameron Simpson wrote:

...
With str subtypes, the case that comes to my mind is mixing str subtypes. [...] So, yes, for many methods I might reasonably expect a new html(str). But I can contrive situations where I'd want a plain str

The key word there is *contrive*.

...

Obviously there are methods that are expected to return plain old strings. If you have a html.extract_content() method which extracts the body of the html document as plain text, stripping out all markup, there is no point returning a html object and a str will do. But most methods will need to keep the markup, and so they will need to return a html object.

...

[...] The status quo mostly hurts *lightweight* subclasses:

class TurkishString(str): def upper(self): return TurkishString(str.upper(self.replace('i', 'İ'))) def lower(self): return TurkishString(str.lower(self.replace('I', 'ı')))

That's fine so long as the *only* operations you do to a TurkishString is upper or lower. As soon as you do concatenation, substring replacement, stripping, joining, etc you get a regular string.

So we've gone from a lightweight subclass that needs to override two methods, to a heavyweight subclass that needs to override 30+ methods.

...

This is probably why we don't rely on subclassing that much. Easier to just write a top-level function and forget about subclassing.

Oooh, I do a _lot_ of subclassing :-) Cheers, Cameron Simpson <cs@cskk.id.au>

Rob Cliffe

December 2022

12:31 a.m.

New subject: Idea: Tagged strings in python

On 20/12/2022 09:16, Steven D'Aprano wrote:

...

Caveat: If you were subclassing str, you would probably want __str__ and __repr__ (if you were not overriding them) to return plain strings. Best wishes Rob Cliffe

Brendan Barnwell

2:55 a.m.

New subject: Idea: Tagged strings in python

On 2022-12-19 13:59, Chris Angelico wrote:

...

Your example used StrEnum, which was added in Python 3.11.

...

Chris Angelico

3:15 a.m.

New subject: Idea: Tagged strings in python

On Tue, 20 Dec 2022 at 13:56, Brendan Barnwell <brenbarn@brenbarn.net> wrote:

...

I'm of the opinion that this is a lot less special than you might think, since there are quite a lot of these sorts of special cases.

...

Maybe, but I would say that the solution is to make an easier way to make a subclass that automatically does those changes - not to make this the behaviour of all classes, everywhere. Your idea to:

...