Make undefined escape sequences have SyntaxWarnings

The literal"\c" should be an error but in practice means "\\c". It's probably too late to make this invalid syntax as it out to be, but I wonder if a warning isn't in order, especially with the theoretical potential of adding new string escapes in the future.

On Wed, 10 Oct 2012 15:36:08 -0400 Mike Graham <mikegraham@gmail.com> wrote:
-1. This will make life more difficult with regular expressions (and produce lots of spurious warnings in existing code). Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

On 10.10.12 22:46, Antoine Pitrou wrote:
-1. This will make life more difficult with regular expressions (and produce lots of spurious warnings in existing code).
Strings for regular expressions always should be raw. Now regular expressions supports \u and \U escapes and no reason to use non-raw strings.

On Wed, 10 Oct 2012 23:04:25 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:
That's a style issue, not a language rule. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

On 10.10.12 23:18, Antoine Pitrou wrote:
Yes, of course, that's a style advice. Sorry if I used the wrong words. This will not make life more difficult with regular expressions because you always can use raw string literals.

http://docs.python.org/release/3.3.0/reference/lexical_analysis.html#string-... I'm not sure I understand what this line from the docs means: \newline Backslash and newline ignored I understand that row as either "\n" won't appear in the resulting string or that I should get "\\newline". Yuval Greenfield

On 11/10/12 07:04, Serhiy Storchaka wrote:
Why? The re module doesn't care how you construct the strings. It *can't* care how you construct the strings. Something like re.search('\D*', 'abcd1234xyz') works perfectly well and there is no need for a raw string. Any requirement to "always use raw strings" is a style issue, not a language issue. -- Steven

On Wed, Oct 10, 2012 at 3:46 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Regular expressions are difficult if you're remembering which escape sequences exist and are easy if you're using raw string literals. Mike

On Wed, 10 Oct 2012 16:08:22 -0400 Mike Graham <mikegraham@gmail.com> wrote:
That's a misconception, since as the re docs mention: “Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser: [snip]” http://docs.python.org/dev/library/re.html In other words, whether you put "\t" or "\\t" in a regexp doesn't matter: it means the same to the regexp engine. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

On 11/10/12 07:08, Mike Graham wrote:
The literal"\c" should be an error
Who says so? My bash shell disagrees with you: [steve@ando ~]$ touch spam [steve@ando ~]$ ls s\pa\m spam and so do I. There are three obvious behaviours for extraneous escapes: 1) backslash-c resolves to just c (what bash and VisualStudio do) 2) backslash-c resolves to backslash-c (what Python does) 3) raise an exception or compile-time error (what Java does) It is undefined behaviour in C. It is a matter of opinion that Java got it right and the others got it wrong, one which I do not share.
I agree with Antoine here. If and when there is a serious, concrete proposal to add a new string escape, and not just a "theoretical potential", then we should consider adding warnings.
Regular expressions are difficult if you're remembering which escape sequences exist and are easy if you're using raw string literals.
Just because some people find it hard to remember doesn't mean that it should be an error *not* to use raw strings. -- Steven

On Wed, Oct 10, 2012 at 8:08 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Frankly, I don't look to bash for sensible language design advice. I think concepts like "In the face of ambiguity, refuse the temptation to guess" guides how we should see the decision here. "Backslash is for escape sequences except when it's not" seemed like an obviously-misfortunate thing to me. I'm truly perplexed people see it as a feature they're eager to use, but I guess I should learn something from that.
I didn't say that it should be an error not to use raw strings. I was saying that the implication that this suggestion makes constructing regex strings hard is silly and mentioning the thing that makes them easy. I'm not suggesting that you shouldn't be able to use normal string literals. Antoine went on to point out that things like "\t" worked in regex strings. This is an unrelated feature that I never suggested altering. In that case, a tab character in your string is regarded like \t. This behavior would remain. I think four string escapes have been added since versions of Python I was aware of. Writing code like "ab\c" seems seedy in light of that Mike

On 11/10/12 13:24, Mike Graham wrote:
Pity, because in this case I think bash is actually more sensible than either Python or Java. If you escape a character, you should get something. If it's a special character, you get the special meaning. If it's not, escaping should be transparent: escaping something that doesn't need escaping is a null op: py> from urllib import quote_plus py> quote_plus('abc') 'abc' If we were designing Python from scratch, I'd prefer '\D' -> 'D'. But we're not, so I'm happy with the current behaviour, and don't agree that it should be an error or that it needs warning about.
Where is the ambiguity? Is there ever a context where \D could mean two different things and it isn't clear which one? "In the face of ambiguity..." does not mean "refuse to decide on language behaviour". Everything is ambiguous until you decide what something will mean. It's only when you have two possible meanings and no clear, obvious way to determine which one applies that the ambiguity koan applies.
No. In cooked strings, backslash-C is always an escape sequence, for any character (or hex/oct code) C. But some escape sequences resolve to a single char (\n -> newline) and some resolve to a pair of chars (\D -> backslash D). In Haskell, \& resolves to the empty string. It's still an escape sequence. [...]
I think four string escapes have been added since versions of Python I was aware of. Writing code like "ab\c" seems seedy in light of that
Adding a new escape sequence is almost as big a step as adding a new built-in or new syntax. I see that as a good thing, it discourages too many requests for new escape sequences. -- Steven

Steven D'Aprano wrote:
I think that calling "\n", "\t" etc. "escape sequences" is a misnomer that is causing confusion in this discussion. The term "escape" in this context means to prevent something from having a special meaning that it would otherwise have. But the backslash in these is being used to *give* a special meaning to the following character. In Python string literals, the only true escape sequences associated with the backslash are '\\', "\'" and '\"'. So the backslash is a bit schizophrenic -- sometimes it's an escape character, sometimes it's a prefix that imparts a special meaning. This means that "\c" where c is not special in any way is somewhat ambiguous. Are you redundantly escaping something that doesn't need it, are you asking for a special meaning that doesn't exist (which is probably a mistake), or do you just want a literal backslash? Python guesses that you want a literal backslash. This seems to be motivated by the desire to minimise the need for backslash doubling. That sounds fine in theory, but I don't think it helps much in practice. I for one don't trust myself to keep the entire set of special characters in my head, including all the rarely-used ones, so I end up doubling every backslash anyway. Given that, I wouldn't have minded at all if Python had refused to guess in this case, and raised a compile-time error. That would have left the way open for extending the set of special chars in the future.
I don't see it makes much difference. We get plenty of requests for new syntax of all kinds, and we seem to have enough sense to reject them unless they're backed by extremely good arguments. There's no reason requests for new special chars should be treated any differently. -- Greg

On 2012-10-10 20:46, Antoine Pitrou wrote:
How would it make life more difficult with regular expressions? I would've preferred: 1. Unknown escapes in string literals give a compile-time error 2. Raw string literals treat backslashes as pure literals 3. Unknown escapes in regex patterns give a run-time error Unfortunately, changing them would break existing code. (I retain the behaviour of re in the regex module for this reason, not that I like it. :-() It would've been nice if the 'fix' had been made in Python 3...

On Wed, 10 Oct 2012 15:36:08 -0400 Mike Graham <mikegraham@gmail.com> wrote:
-1. This will make life more difficult with regular expressions (and produce lots of spurious warnings in existing code). Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

On 10.10.12 22:46, Antoine Pitrou wrote:
-1. This will make life more difficult with regular expressions (and produce lots of spurious warnings in existing code).
Strings for regular expressions always should be raw. Now regular expressions supports \u and \U escapes and no reason to use non-raw strings.

On Wed, 10 Oct 2012 23:04:25 +0300 Serhiy Storchaka <storchaka@gmail.com> wrote:
That's a style issue, not a language rule. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

On 10.10.12 23:18, Antoine Pitrou wrote:
Yes, of course, that's a style advice. Sorry if I used the wrong words. This will not make life more difficult with regular expressions because you always can use raw string literals.

http://docs.python.org/release/3.3.0/reference/lexical_analysis.html#string-... I'm not sure I understand what this line from the docs means: \newline Backslash and newline ignored I understand that row as either "\n" won't appear in the resulting string or that I should get "\\newline". Yuval Greenfield

On 11/10/12 07:04, Serhiy Storchaka wrote:
Why? The re module doesn't care how you construct the strings. It *can't* care how you construct the strings. Something like re.search('\D*', 'abcd1234xyz') works perfectly well and there is no need for a raw string. Any requirement to "always use raw strings" is a style issue, not a language issue. -- Steven

On Wed, Oct 10, 2012 at 3:46 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Regular expressions are difficult if you're remembering which escape sequences exist and are easy if you're using raw string literals. Mike

On Wed, 10 Oct 2012 16:08:22 -0400 Mike Graham <mikegraham@gmail.com> wrote:
That's a misconception, since as the re docs mention: “Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser: [snip]” http://docs.python.org/dev/library/re.html In other words, whether you put "\t" or "\\t" in a regexp doesn't matter: it means the same to the regexp engine. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

On 11/10/12 07:08, Mike Graham wrote:
The literal"\c" should be an error
Who says so? My bash shell disagrees with you: [steve@ando ~]$ touch spam [steve@ando ~]$ ls s\pa\m spam and so do I. There are three obvious behaviours for extraneous escapes: 1) backslash-c resolves to just c (what bash and VisualStudio do) 2) backslash-c resolves to backslash-c (what Python does) 3) raise an exception or compile-time error (what Java does) It is undefined behaviour in C. It is a matter of opinion that Java got it right and the others got it wrong, one which I do not share.
I agree with Antoine here. If and when there is a serious, concrete proposal to add a new string escape, and not just a "theoretical potential", then we should consider adding warnings.
Regular expressions are difficult if you're remembering which escape sequences exist and are easy if you're using raw string literals.
Just because some people find it hard to remember doesn't mean that it should be an error *not* to use raw strings. -- Steven

On Wed, Oct 10, 2012 at 8:08 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Frankly, I don't look to bash for sensible language design advice. I think concepts like "In the face of ambiguity, refuse the temptation to guess" guides how we should see the decision here. "Backslash is for escape sequences except when it's not" seemed like an obviously-misfortunate thing to me. I'm truly perplexed people see it as a feature they're eager to use, but I guess I should learn something from that.
I didn't say that it should be an error not to use raw strings. I was saying that the implication that this suggestion makes constructing regex strings hard is silly and mentioning the thing that makes them easy. I'm not suggesting that you shouldn't be able to use normal string literals. Antoine went on to point out that things like "\t" worked in regex strings. This is an unrelated feature that I never suggested altering. In that case, a tab character in your string is regarded like \t. This behavior would remain. I think four string escapes have been added since versions of Python I was aware of. Writing code like "ab\c" seems seedy in light of that Mike

On 11/10/12 13:24, Mike Graham wrote:
Pity, because in this case I think bash is actually more sensible than either Python or Java. If you escape a character, you should get something. If it's a special character, you get the special meaning. If it's not, escaping should be transparent: escaping something that doesn't need escaping is a null op: py> from urllib import quote_plus py> quote_plus('abc') 'abc' If we were designing Python from scratch, I'd prefer '\D' -> 'D'. But we're not, so I'm happy with the current behaviour, and don't agree that it should be an error or that it needs warning about.
Where is the ambiguity? Is there ever a context where \D could mean two different things and it isn't clear which one? "In the face of ambiguity..." does not mean "refuse to decide on language behaviour". Everything is ambiguous until you decide what something will mean. It's only when you have two possible meanings and no clear, obvious way to determine which one applies that the ambiguity koan applies.
No. In cooked strings, backslash-C is always an escape sequence, for any character (or hex/oct code) C. But some escape sequences resolve to a single char (\n -> newline) and some resolve to a pair of chars (\D -> backslash D). In Haskell, \& resolves to the empty string. It's still an escape sequence. [...]
I think four string escapes have been added since versions of Python I was aware of. Writing code like "ab\c" seems seedy in light of that
Adding a new escape sequence is almost as big a step as adding a new built-in or new syntax. I see that as a good thing, it discourages too many requests for new escape sequences. -- Steven

Steven D'Aprano wrote:
I think that calling "\n", "\t" etc. "escape sequences" is a misnomer that is causing confusion in this discussion. The term "escape" in this context means to prevent something from having a special meaning that it would otherwise have. But the backslash in these is being used to *give* a special meaning to the following character. In Python string literals, the only true escape sequences associated with the backslash are '\\', "\'" and '\"'. So the backslash is a bit schizophrenic -- sometimes it's an escape character, sometimes it's a prefix that imparts a special meaning. This means that "\c" where c is not special in any way is somewhat ambiguous. Are you redundantly escaping something that doesn't need it, are you asking for a special meaning that doesn't exist (which is probably a mistake), or do you just want a literal backslash? Python guesses that you want a literal backslash. This seems to be motivated by the desire to minimise the need for backslash doubling. That sounds fine in theory, but I don't think it helps much in practice. I for one don't trust myself to keep the entire set of special characters in my head, including all the rarely-used ones, so I end up doubling every backslash anyway. Given that, I wouldn't have minded at all if Python had refused to guess in this case, and raised a compile-time error. That would have left the way open for extending the set of special chars in the future.
I don't see it makes much difference. We get plenty of requests for new syntax of all kinds, and we seem to have enough sense to reject them unless they're backed by extremely good arguments. There's no reason requests for new special chars should be treated any differently. -- Greg

On 2012-10-10 20:46, Antoine Pitrou wrote:
How would it make life more difficult with regular expressions? I would've preferred: 1. Unknown escapes in string literals give a compile-time error 2. Raw string literals treat backslashes as pure literals 3. Unknown escapes in regex patterns give a run-time error Unfortunately, changing them would break existing code. (I retain the behaviour of re in the regex module for this reason, not that I like it. :-() It would've been nice if the 'fix' had been made in Python 3...
participants (7)
-
Antoine Pitrou
-
Greg Ewing
-
Mike Graham
-
MRAB
-
Serhiy Storchaka
-
Steven D'Aprano
-
Yuval Greenfield