eval and triple quoted strings
Hello all! This surprised me: >>> eval("'''\r\n'''") '\n' Where did the \r go? ast.literal_eval() has the same problem: >>> ast.literal_eval("'''\r\n'''") '\n' Is this a bug/worth fixing? Servus, Walter
Not a bug. The same is done for file input -- CRLF is changed to LF before tokenizing. On Jun 14, 2013 8:27 AM, "Walter Dörwald" <walter@livinglogic.de> wrote:
Hello all!
This surprised me:
eval("'''\r\n'''") '\n'
Where did the \r go? ast.literal_eval() has the same problem:
ast.literal_eval("'''\r\n'''") '\n'
Is this a bug/worth fixing?
Servus, Walter ______________________________**_________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/**mailman/listinfo/python-dev<http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/**mailman/options/python-dev/** guido%40python.org<http://mail.python.org/mailman/options/python-dev/guido%40python.org>
On 06/14/2013 10:36 AM, Guido van Rossum wrote:
Not a bug. The same is done for file input -- CRLF is changed to LF before tokenizing.
Should this be the same? python3 -c 'print(bytes("""\r\n""", "utf8"))' b'\r\n'
eval('print(bytes("""\r\n""", "utf8"))') b'\n'
Ron
On Jun 14, 2013 8:27 AM, "Walter Dörwald" <walter@livinglogic.de <mailto:walter@livinglogic.de>> wrote:
Hello all!
This surprised me:
>>> eval("'''\r\n'''") '\n'
Where did the \r go? ast.literal_eval() has the same problem:
>>> ast.literal_eval("'''\r\n'''") '\n'
Is this a bug/worth fixing?
Servus, Walter _________________________________________________ Python-Dev mailing list Python-Dev@python.org <mailto:Python-Dev@python.org> http://mail.python.org/__mailman/listinfo/python-dev <http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/__mailman/options/python-dev/__guido%40python.org <http://mail.python.org/mailman/options/python-dev/guido%40python.org>
On Fri, Jun 14, 2013 at 2:11 PM, Ron Adam <ron3200@gmail.com> wrote:
On 06/14/2013 10:36 AM, Guido van Rossum wrote:
Not a bug. The same is done for file input -- CRLF is changed to LF before tokenizing.
Should this be the same?
python3 -c 'print(bytes("""\r\n""", "utf8"))' b'\r\n'
eval('print(bytes("""\r\n""", "utf8"))') b'\n'
No, but: eval(r'print(bytes("""\r\n""", "utf8"))') should be. (And is.) What I believe you and Walter are missing is that the \r\n in the eval strings are converted early if you don't make the enclosing string raw. So what you're eval-ing is not what you think you are eval-ing, hence the confusion.
On 06/14/2013 04:03 PM, PJ Eby wrote:
Should this be the same?
python3 -c 'print(bytes("""\r\n""", "utf8"))' b'\r\n'
>>>eval('print(bytes("""\r\n""", "utf8"))') b'\n' No, but:
eval(r'print(bytes("""\r\n""", "utf8"))')
should be. (And is.)
What I believe you and Walter are missing is that the \r\n in the eval strings are converted early if you don't make the enclosing string raw. So what you're eval-ing is not what you think you are eval-ing, hence the confusion.
Yes thanks, seems like an easy mistake to make. To be clear... The string to eval is parsed when the eval line is tokenized in the scope containing the eval() function. The eval function then parses the resulting string object it receives as it's input. There is no mention of using raw strings in the docs on evel and exec. I think there should be, because the intention (in most cases) is for eval to parse the string, and not for it to be parsed or changed before it's evaluated by eval or exec. An example using a string with escape characters might make it clearer. Cheers, Ron
The semantics of raw strings are clear. I don't see that they should be called out especially in any context. (Except for regexps.) Usually exec() is not used with a literal anyway (what would be the point). --Guido van Rossum (sent from Android phone) On Jun 15, 2013 1:03 PM, "Ron Adam" <ron3200@gmail.com> wrote:
On 06/14/2013 04:03 PM, PJ Eby wrote:
Should this be the same?
python3 -c 'print(bytes("""\r\n""", "utf8"))' b'\r\n'
>>eval('print(bytes("""\r\n"**"", "utf8"))')
b'\n'
No, but:
eval(r'print(bytes("""\r\n""", "utf8"))')
should be. (And is.)
What I believe you and Walter are missing is that the \r\n in the eval strings are converted early if you don't make the enclosing string raw. So what you're eval-ing is not what you think you are eval-ing, hence the confusion.
Yes thanks, seems like an easy mistake to make.
To be clear...
The string to eval is parsed when the eval line is tokenized in the scope containing the eval() function. The eval function then parses the resulting string object it receives as it's input.
There is no mention of using raw strings in the docs on evel and exec. I think there should be, because the intention (in most cases) is for eval to parse the string, and not for it to be parsed or changed before it's evaluated by eval or exec.
An example using a string with escape characters might make it clearer.
Cheers, Ron ______________________________**_________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/**mailman/listinfo/python-dev<http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/**mailman/options/python-dev/** guido%40python.org<http://mail.python.org/mailman/options/python-dev/guido%40python.org>
On 06/15/2013 03:23 PM, Guido van Rossum wrote:
The semantics of raw strings are clear. I don't see that they should be called out especially in any context. (Except for regexps.) Usually exec() is not used with a literal anyway (what would be the point).
There are about a hundred instances of eval/exec(some_string_literal) in pythons library. Most of them in the tests, and maybe about half of those testing the compiler, eval, and exec. egrep -owr --include="*.py" "(eval|exec)\(('.*'|\".*\")\)" * | wc -l 114 I have no idea in how many places a string literal is assigned to a name first and then used later in eval or exec. It's harder to grep for but would be less than... egrep -owr --include="*.py" "(eval|exec)\(.*\)" * | wc -l 438 That's overstated because some of those are comments, and some may be functions with the name ending with eval or exec. I do think that eval and exec is a similar case to regexps. And possibly often enough, the string may contain a raw string, regular expression, or a file/path name. Only a short note needed in the docs for eval, nothing more. And not even that if no thinks it's an issue. cheers, Ron
On 14.06.13 23:03, PJ Eby wrote:
On Fri, Jun 14, 2013 at 2:11 PM, Ron Adam <ron3200@gmail.com> wrote:
On 06/14/2013 10:36 AM, Guido van Rossum wrote:
Not a bug. The same is done for file input -- CRLF is changed to LF before tokenizing.
Should this be the same?
python3 -c 'print(bytes("""\r\n""", "utf8"))' b'\r\n'
eval('print(bytes("""\r\n""", "utf8"))') b'\n'
No, but:
eval(r'print(bytes("""\r\n""", "utf8"))')
should be. (And is.)
What I believe you and Walter are missing is that the \r\n in the eval strings are converted early if you don't make the enclosing string raw. So what you're eval-ing is not what you think you are eval-ing, hence the confusion.
I expected that eval()ing a string that contains the characters U+0027: APOSTROPHE U+0027: APOSTROPHE U+0027: APOSTROPHE U+000D: CR U+000A: LR U+0027: APOSTROPHE U+0027: APOSTROPHE U+0027: APOSTROPHE to return a string containing the characters: U+000D: CR U+000A: LR Making the string raw, of course turns it into: U+0027: APOSTROPHE U+0027: APOSTROPHE U+0027: APOSTROPHE U+005C: REVERSE SOLIDUS U+0072: LATIN SMALL LETTER R U+005C: REVERSE SOLIDUS U+006E: LATIN SMALL LETTER N U+0027: APOSTROPHE U+0027: APOSTROPHE U+0027: APOSTROPHE and eval()ing that does indeed give "\r\n" as expected. Hmm, it seems that codecs.unicode_escape_decode() does what I want:
codecs.unicode_escape_decode("\r\n\\r\\n\\x0d\\x0a\\u000d\\u000a") ('\r\n\r\n\r\n\r\n', 26)
Servus, Walter
On 17.06.13 19:04, Walter Dörwald wrote:
Hmm, it seems that codecs.unicode_escape_decode() does what I want:
codecs.unicode_escape_decode("\r\n\\r\\n\\x0d\\x0a\\u000d\\u000a") ('\r\n\r\n\r\n\r\n', 26)
Hmm, no it doesn't:
codecs.unicode_escape_decode("\u1234") ('á\x88´', 3)
Servus, Walter
On Mon, Jun 17, 2013 at 10:04 AM, Walter Dörwald <walter@livinglogic.de> wrote:
I expected that eval()ing a string that contains the characters
U+0027: APOSTROPHE U+0027: APOSTROPHE U+0027: APOSTROPHE U+000D: CR U+000A: LR U+0027: APOSTROPHE U+0027: APOSTROPHE U+0027: APOSTROPHE
to return a string containing the characters:
U+000D: CR U+000A: LR
No. Executing a file containing those exact characters produces a string containing only '\n' and exec/eval is meant to behave the same way. The string may not have originated from a file, so the universal newlines behavior of the io module is irrelevant here -- the parser must implement its own equivalent processing, and it does. -- --Guido van Rossum (python.org/~guido)
Guido van Rossum wrote:
No. Executing a file containing those exact characters produces a string containing only '\n' and exec/eval is meant to behave the same way. The string may not have originated from a file, so the universal newlines behavior of the io module is irrelevant here -- the parser must implement its own equivalent processing, and it does.
I'm still not convinced that this is necessary or desirable behaviour. I can understand the parser doing this as a workaround before we had universal newlines, but now that we do, I'd expect any Python string to already have newlines converted to their canonical representation, and that any CRs it contains are meant to be there. The parser shouldn't need to do newline translation a second time. -- Greg
On Mon, Jun 17, 2013 at 3:18 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Guido van Rossum wrote:
No. Executing a file containing those exact characters produces a string containing only '\n' and exec/eval is meant to behave the same way. The string may not have originated from a file, so the universal newlines behavior of the io module is irrelevant here -- the parser must implement its own equivalent processing, and it does.
I'm still not convinced that this is necessary or desirable behaviour. I can understand the parser doing this as a workaround before we had universal newlines, but now that we do, I'd expect any Python string to already have newlines converted to their canonical representation, and that any CRs it contains are meant to be there. The parser shouldn't need to do newline translation a second time.
There are other ways to get a string besides reading it from a file. Anyway, I think that if you want a string literal that contains \r\n as its line endings, you should use a syntactic solution, and the syntax ought to be the same regardless of whether you are reading it from a file or from a string literal. That syntactic solution is very clear: """line one\r line two\r line three\r """ This works everywhere. -- --Guido van Rossum (python.org/~guido)
2013/6/17 Greg Ewing <greg.ewing@canterbury.ac.nz>:
Guido van Rossum wrote:
No. Executing a file containing those exact characters produces a string containing only '\n' and exec/eval is meant to behave the same way. The string may not have originated from a file, so the universal newlines behavior of the io module is irrelevant here -- the parser must implement its own equivalent processing, and it does.
I'm still not convinced that this is necessary or desirable behaviour. I can understand the parser doing this as a workaround before we had universal newlines, but now that we do, I'd expect any Python string to already have newlines converted to their canonical representation, and that any CRs it contains are meant to be there. The parser shouldn't need to do newline translation a second time.
It used to be that way until 2.7. People like to do things like with open("myfile.py", "rb") as fp: exec fp.read() in ns which used to fail with CRLF newlines because binary mode doesn't have them. I think this is actually the correct way to execute Python sources because the parser then handles the somewhat complicated process of decoding Python source for you. -- Regards, Benjamin
On Mon, Jun 17, 2013 at 4:40 PM, Benjamin Peterson <benjamin@python.org> wrote:
2013/6/17 Greg Ewing <greg.ewing@canterbury.ac.nz>:
Guido van Rossum wrote:
No. Executing a file containing those exact characters produces a string containing only '\n' and exec/eval is meant to behave the same way. The string may not have originated from a file, so the universal newlines behavior of the io module is irrelevant here -- the parser must implement its own equivalent processing, and it does.
I'm still not convinced that this is necessary or desirable behaviour. I can understand the parser doing this as a workaround before we had universal newlines, but now that we do, I'd expect any Python string to already have newlines converted to their canonical representation, and that any CRs it contains are meant to be there. The parser shouldn't need to do newline translation a second time.
It used to be that way until 2.7. People like to do things like
with open("myfile.py", "rb") as fp: exec fp.read() in ns
which used to fail with CRLF newlines because binary mode doesn't have them. I think this is actually the correct way to execute Python sources because the parser then handles the somewhat complicated process of decoding Python source for you.
What exactly does the parser handles better than the io module? Is it just the coding cookies? I suppose that works as long as the file is encoded using as ASCII superset like the Latin-N variants or UTF-8. It would fail pretty badly if it was UTF-16 (and yes, that's an abominable encoding for other reasons :-). -- --Guido van Rossum (python.org/~guido)
2013/6/17 Guido van Rossum <guido@python.org>:
On Mon, Jun 17, 2013 at 4:40 PM, Benjamin Peterson <benjamin@python.org> wrote:
2013/6/17 Greg Ewing <greg.ewing@canterbury.ac.nz>:
Guido van Rossum wrote:
No. Executing a file containing those exact characters produces a string containing only '\n' and exec/eval is meant to behave the same way. The string may not have originated from a file, so the universal newlines behavior of the io module is irrelevant here -- the parser must implement its own equivalent processing, and it does.
I'm still not convinced that this is necessary or desirable behaviour. I can understand the parser doing this as a workaround before we had universal newlines, but now that we do, I'd expect any Python string to already have newlines converted to their canonical representation, and that any CRs it contains are meant to be there. The parser shouldn't need to do newline translation a second time.
It used to be that way until 2.7. People like to do things like
with open("myfile.py", "rb") as fp: exec fp.read() in ns
which used to fail with CRLF newlines because binary mode doesn't have them. I think this is actually the correct way to execute Python sources because the parser then handles the somewhat complicated process of decoding Python source for you.
What exactly does the parser handles better than the io module? Is it just the coding cookies? I suppose that works as long as the file is encoded using as ASCII superset like the Latin-N variants or UTF-8. It would fail pretty badly if it was UTF-16 (and yes, that's an abominable encoding for other reasons :-).
The coding cookie is the main one. In fact, if you can't parse that, you don't really know what encoding to open the file with at all. There's also small things like BOM handling (you have to use the utf-16-sig encoding with TextIO to get it removed) and defaulting to UTF-8 (which the io module doesn't do) which is better left to the parser. -- Regards, Benjamin
On Mon, Jun 17, 2013 at 5:02 PM, Benjamin Peterson <benjamin@python.org> wrote:
2013/6/17 Guido van Rossum <guido@python.org>:
On Mon, Jun 17, 2013 at 4:40 PM, Benjamin Peterson <benjamin@python.org> wrote:
2013/6/17 Greg Ewing <greg.ewing@canterbury.ac.nz>:
Guido van Rossum wrote:
No. Executing a file containing those exact characters produces a string containing only '\n' and exec/eval is meant to behave the same way. The string may not have originated from a file, so the universal newlines behavior of the io module is irrelevant here -- the parser must implement its own equivalent processing, and it does.
I'm still not convinced that this is necessary or desirable behaviour. I can understand the parser doing this as a workaround before we had universal newlines, but now that we do, I'd expect any Python string to already have newlines converted to their canonical representation, and that any CRs it contains are meant to be there. The parser shouldn't need to do newline translation a second time.
It used to be that way until 2.7. People like to do things like
with open("myfile.py", "rb") as fp: exec fp.read() in ns
which used to fail with CRLF newlines because binary mode doesn't have them. I think this is actually the correct way to execute Python sources because the parser then handles the somewhat complicated process of decoding Python source for you.
What exactly does the parser handles better than the io module? Is it just the coding cookies? I suppose that works as long as the file is encoded using as ASCII superset like the Latin-N variants or UTF-8. It would fail pretty badly if it was UTF-16 (and yes, that's an abominable encoding for other reasons :-).
The coding cookie is the main one. In fact, if you can't parse that, you don't really know what encoding to open the file with at all. There's also small things like BOM handling (you have to use the utf-16-sig encoding with TextIO to get it removed) and defaulting to UTF-8 (which the io module doesn't do) which is better left to the parser.
Maybe there are some lessons here that the TextIO module could learn? -- --Guido van Rossum (python.org/~guido)
It may be possible to implement parsing the codec cookie as a Python codec :-) Victor 2013/6/18 Guido van Rossum <guido@python.org>:
On Mon, Jun 17, 2013 at 5:02 PM, Benjamin Peterson <benjamin@python.org> wrote:
2013/6/17 Guido van Rossum <guido@python.org>:
On Mon, Jun 17, 2013 at 4:40 PM, Benjamin Peterson <benjamin@python.org> wrote:
2013/6/17 Greg Ewing <greg.ewing@canterbury.ac.nz>:
Guido van Rossum wrote:
No. Executing a file containing those exact characters produces a string containing only '\n' and exec/eval is meant to behave the same way. The string may not have originated from a file, so the universal newlines behavior of the io module is irrelevant here -- the parser must implement its own equivalent processing, and it does.
I'm still not convinced that this is necessary or desirable behaviour. I can understand the parser doing this as a workaround before we had universal newlines, but now that we do, I'd expect any Python string to already have newlines converted to their canonical representation, and that any CRs it contains are meant to be there. The parser shouldn't need to do newline translation a second time.
It used to be that way until 2.7. People like to do things like
with open("myfile.py", "rb") as fp: exec fp.read() in ns
which used to fail with CRLF newlines because binary mode doesn't have them. I think this is actually the correct way to execute Python sources because the parser then handles the somewhat complicated process of decoding Python source for you.
What exactly does the parser handles better than the io module? Is it just the coding cookies? I suppose that works as long as the file is encoded using as ASCII superset like the Latin-N variants or UTF-8. It would fail pretty badly if it was UTF-16 (and yes, that's an abominable encoding for other reasons :-).
The coding cookie is the main one. In fact, if you can't parse that, you don't really know what encoding to open the file with at all. There's also small things like BOM handling (you have to use the utf-16-sig encoding with TextIO to get it removed) and defaulting to UTF-8 (which the io module doesn't do) which is better left to the parser.
Maybe there are some lessons here that the TextIO module could learn?
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com
2013/6/17 Guido van Rossum <guido@python.org>:
On Mon, Jun 17, 2013 at 5:02 PM, Benjamin Peterson <benjamin@python.org> wrote:
2013/6/17 Guido van Rossum <guido@python.org>:
On Mon, Jun 17, 2013 at 4:40 PM, Benjamin Peterson <benjamin@python.org> wrote: What exactly does the parser handles better than the io module? Is it just the coding cookies? I suppose that works as long as the file is encoded using as ASCII superset like the Latin-N variants or UTF-8. It would fail pretty badly if it was UTF-16 (and yes, that's an abominable encoding for other reasons :-).
The coding cookie is the main one. In fact, if you can't parse that, you don't really know what encoding to open the file with at all. There's also small things like BOM handling (you have to use the utf-16-sig encoding with TextIO to get it removed) and defaulting to UTF-8 (which the io module doesn't do) which is better left to the parser.
Maybe there are some lessons here that the TextIO module could learn?
UTF-8 by default would be great, but that ship has sailed. Reading Python coding cookies is outside the purview of TextIOWrapper. However, it would be good to have a function in the stdlib to read a python source file to Unicode; I've definitely implemented that several times. -- Regards, Benjamin
Le 17/06/2013 20:49, Benjamin Peterson a écrit :
Reading Python coding cookies is outside the purview of TextIOWrapper. However, it would be good to have a function in the stdlib to read a python source file to Unicode; I've definitely implemented that several times.
IIUC you want http://docs.python.org/3/library/tokenize#tokenize.open (3.2+). Regards
2013/6/17 Éric Araujo <merwok@netwok.org>:
Le 17/06/2013 20:49, Benjamin Peterson a écrit :
Reading Python coding cookies is outside the purview of TextIOWrapper. However, it would be good to have a function in the stdlib to read a python source file to Unicode; I've definitely implemented that several times.
IIUC you want http://docs.python.org/3/library/tokenize#tokenize.open (3.2+).
Yep. :) -- Regards, Benjamin
On 06/17/2013 05:18 PM, Greg Ewing wrote:
I'm still not convinced that this is necessary or desirable behaviour. I can understand the parser doing this as a workaround before we had universal newlines, but now that we do, I'd expect any Python string to already have newlines converted to their canonical representation, and that any CRs it contains are meant to be there. The parser shouldn't need to do newline translation a second time.
It's the other way around. Eval and exec should generate the same results as pythons compiler with the same input, including errors and exceptions. The only way we can have that is if eval and exec parses everything the same way. It's the first parsing that needs to be avoided or compensated for in these cases. Raw strings (my preference) works for string literals, or you can escape the escape codes so they are still individual characters after the first translation. Or read the code directly from a file rather than importing it. For example, if you wrote your own python console program, you would want all the errors and exceptions to come from eval, including those for bad strings. You would still need to feed the bad strings to eval. If you don't then you won't get the same output from eval as the compiler does. Cheers, Ron
On 06/17/2013 12:04 PM, Walter Dörwald wrote:
Making the string raw, of course turns it into:
U+0027: APOSTROPHE U+0027: APOSTROPHE U+0027: APOSTROPHE U+005C: REVERSE SOLIDUS U+0072: LATIN SMALL LETTER R U+005C: REVERSE SOLIDUS U+006E: LATIN SMALL LETTER N U+0027: APOSTROPHE U+0027: APOSTROPHE U+0027: APOSTROPHE
and eval()ing that does indeed give "\r\n" as expected.
You can also escape the reverse slashes in a regular string to get the same result.
s1 = "'''\\r\\n'''" list(s1) ["'", "'", "'", '\\', 'r', '\\', 'n', "'", "'", "'"]
s2 = eval(s1) list(s2) ['\r', '\n']
s3 = "'''%s'''" % s2 list(s3) ["'", "'", "'", '\r', '\n', "'", "'", "'"]
s4 = eval(s3) list(s4) ['\n']
When a standard string literal used with eval, it's evaluated first to a string object in the same scope as the eval function is called from, then the eval function is called with that string object and it's evaluated again. So it's really being parsed twice. (That's the part that got me.) The transformation between s1 and s2 is what Phillip is referring to, and Guido is referring to the transformation from s2 to s4. (s3 is needed to avoid the end of line error of evaluating a single quoted string with \n in it.) When a sting literal is used directly with eval, it looks like it is evaluated from s1 to s4 in one step, but that isn't what is happening. Cheers, Ron (ps: Was informed my posts where showing up twice.. hopefully I got that fixed now.)
Guido van Rossum wrote:
Not a bug. The same is done for file input -- CRLF is changed to LF before tokenizing.
I'm not convinced it's reasonable behaviour to re-scan the string as though it's being read from a file. It's a Python string, so it's already been through whatever line-ending transformation is appropriate to get it into memory. -- Greg
On 15 June 2013 14:08, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Guido van Rossum wrote:
Not a bug. The same is done for file input -- CRLF is changed to LF before tokenizing.
I'm not convinced it's reasonable behaviour to re-scan the string as though it's being read from a file. It's a Python string, so it's already been through whatever line-ending transformation is appropriate to get it into memory.
list(tokenize.tokenize((l for l in (b"""'\r\n'""", b"")).__next__))[2] TokenInfo(type=4 (NEWLINE), string='\r\n', start=(1, 1), end=(1, 3),
No, that's not the way the Python compiler works. The transformation Guido is talking about is the way the tokenizer identifiers "NEWLINE" tokens: line="'\r\n'") This long predates universal newlines mode - it's part of the compilation process, not part of the IO system. The compiler then sees the NEWLINE token in the tokenizer output, and inserts a "\n" into the triple-quoted string. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (9)
-
Benjamin Peterson
-
Greg Ewing
-
Guido van Rossum
-
Nick Coghlan
-
PJ Eby
-
Ron Adam
-
Victor Stinner
-
Walter Dörwald
-
Éric Araujo