Mailman 3 Proposal: from __future__ import unicode_string_literals - Python-Dev

Proposal: from future import unicode_string_literals

Eric Smith

21 Mar 2008 21 Mar '08

12:55 a.m.

Following up on a python-3000 discussion about making porting from 2.6 to 3.0 easier. Martin suggested making this its own thread. This proposal is to add "from __future__ import unicode_string_literals", which would make all string literals in the importing module into unicode objects in 2.6. This is similar to the -U flag, but would only affect a single module at a time. I think history has shown that -U isn't really usable when using any number of modules, including many in the standard library. There was another proposal from Christian Heimes to add "from __future__ import py3k_literals", which would: 1) '' creates an unicode object instead of a str object 2) b'' creates a str object (aka bytes in Python 3.0) 3) 1 creates a long instead of an int 4) 1L and u'' are invalid 2) is already taken care of in 2.6, since: type(b'') == str. I don't think 3) is necessary. It's an implementation detail. 4) is really two issues. It's my understanding that there's a 2to3 fixer for both of these issues. But I'm open to debate on this. I'm willing to implement this if there's consensus on it. Eric.

Show replies by thread

Eric Smith

21 Mar 21 Mar

12:57 p.m.

New subject: Proposal: from __future__ import unicode_string_literals

Eric Smith wrote:

...

This proposal is to add "from __future__ import unicode_string_literals", which would make all string literals in the importing module into unicode objects in 2.6.

I'm going to withdraw this, for 2 reasons. 1) The more I think about it, the less sense it makes. 2) Without some extreme measures, it's not implementable. It's not implementable because the work has to occur in ast.c (see Py_UnicodeFlag). It can't occur later, because you need to skip the encoding being done in parsestr(). But the __future__ import can only be interpreted after the AST is built, at which time the encoding has already been applied. There are some radical things you could do to work around this, but it would be a gigantic change. As for it not making sense, this is really in the realm of 2to3. I'm beginning to really believe this statement in PEP 3000: "There is no requirement that Python 2.6 code will run unmodified on Python 3.0. Not even a subset. (Of course there will be a tiny subset, but it will be missing major functionality.)" For this particular issue, just use u'' in 2.6 and let 2to3 deal with it. If you have some 2.6 code that you want to run in 3.0 (by way of 2to3), I think all of your string literals should either be b'' or u''. Don't use plain ''. Eric.

Christian Heimes

7:24 p.m.

Eric Smith schrieb:

...

It's not implementable because the work has to occur in ast.c (see Py_UnicodeFlag). It can't occur later, because you need to skip the encoding being done in parsestr(). But the __future__ import can only be interpreted after the AST is built, at which time the encoding has already been applied. There are some radical things you could do to work around this, but it would be a gigantic change.

So this basically comes down to "Either spend lots of time (and money) to rewrite the tokenizer and AST generator or keep the current behavior"? :/

...

For this particular issue, just use u'' in 2.6 and let 2to3 deal with it. If you have some 2.6 code that you want to run in 3.0 (by way of 2to3), I think all of your string literals should either be b'' or u''. Don't use plain ''.

For this particular issue one could probably and easily come up with a fast fixer. A simple regexp should be cover 99% of all occurrences of u'' and u"". Christian

Eric Smith

8:06 p.m.

Christian Heimes wrote:

...

Eric Smith schrieb:

...
It's not implementable because the work has to occur in ast.c (see Py_UnicodeFlag). It can't occur later, because you need to skip the encoding being done in parsestr(). But the __future__ import can only be interpreted after the AST is built, at which time the encoding has already been applied. There are some radical things you could do to work around this, but it would be a gigantic change.

So this basically comes down to "Either spend lots of time (and money) to rewrite the tokenizer and AST generator or keep the current behavior"? :/

Pretty much. And even if it were possible, I don't see the point in doing it.

...

...
For this particular issue, just use u'' in 2.6 and let 2to3 deal with it. If you have some 2.6 code that you want to run in 3.0 (by way of 2to3), I think all of your string literals should either be b'' or u''. Don't use plain ''.

For this particular issue one could probably and easily come up with a fast fixer. A simple regexp should be cover 99% of all occurrences of u'' and u"".

2to3 already does this. My current thinking is that only b'' and u'' strings should be in 2.6 code that you want to move to 3.0. Maybe -3 should warn about regular string literals?

Brett Cannon

10:54 p.m.

On Fri, Mar 21, 2008 at 11:06 AM, Eric Smith wrote:

...

Christian Heimes wrote:

...
Eric Smith schrieb:

...
It's not implementable because the work has to occur in ast.c (see Py_UnicodeFlag). It can't occur later, because you need to skip the encoding being done in parsestr(). But the __future__ import can only be interpreted after the AST is built, at which time the encoding has already been applied. There are some radical things you could do to work around this, but it would be a gigantic change.

So this basically comes down to "Either spend lots of time (and money) to rewrite the tokenizer and AST generator or keep the current behavior"? :/

Pretty much. And even if it were possible, I don't see the point in doing it.

...
...
For this particular issue, just use u'' in 2.6 and let 2to3 deal with it. If you have some 2.6 code that you want to run in 3.0 (by way of 2to3), I think all of your string literals should either be b'' or u''. Don't use plain ''.

For this particular issue one could probably and easily come up with a fast fixer. A simple regexp should be cover 99% of all occurrences of u'' and u"".

2to3 already does this.

My current thinking is that only b'' and u'' strings should be in 2.6 code that you want to move to 3.0. Maybe -3 should warn about regular string literals?

That's a possibility. It might also help to have a 3to2 fixer that goes through a module and adds the needed prefixes so one doesn't have to go through manually to tack them on. -Brett

"Martin v. Löwis"

11:32 p.m.

New subject: Proposal: from __future__ import unicode_string_literals

...

It's not implementable because the work has to occur in ast.c (see Py_UnicodeFlag). It can't occur later, because you need to skip the encoding being done in parsestr(). But the __future__ import can only be interpreted after the AST is built, at which time the encoding has already been applied.

I think it would be possible to check for future statements on the basis of nodes already. Take a look at how Python 2.3 implemented future statements (why was that rewritten to use the AST, anyway?).

...

As for it not making sense, this is really in the realm of 2to3. I'm beginning to really believe this statement in PEP 3000:

There is still the original use case of people who don't want to run 2to3 (for whatever reasons - mostly probably subjective ones), and who would rather run a single code base unmodified. They don't care that documentation tells them this is impossible, when they feel they are so close to making it possible. Regards, Martin

M.-A. Lemburg

11:35 p.m.

New subject: Proposal: from __future__ import unicode_string_literals

On 2008-03-21 22:32, Martin v. Löwis wrote:

...

...
It's not implementable because the work has to occur in ast.c (see Py_UnicodeFlag). It can't occur later, because you need to skip the encoding being done in parsestr(). But the __future__ import can only be interpreted after the AST is built, at which time the encoding has already been applied.

I think it would be possible to check for future statements on the basis of nodes already. Take a look at how Python 2.3 implemented future statements (why was that rewritten to use the AST, anyway?).

...
As for it not making sense, this is really in the realm of 2to3. I'm beginning to really believe this statement in PEP 3000:

There is still the original use case of people who don't want to run 2to3 (for whatever reasons - mostly probably subjective ones), and who would rather run a single code base unmodified. They don't care that documentation tells them this is impossible, when they feel they are so close to making it possible.

Could we point them to a special byte-code compiler such as Andrew Dalke's python4ply: http://dalkescientific.com/Python/python4ply.html That approach appears to be a lot easier to implement than trying to tweak the C implementation of the Python parser. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 21 2008)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

Christian Heimes

25 Mar 25 Mar

9:48 p.m.

Follow up: Neal and I've created a working patch, http://bugs.python.org/issue2477 We had to modify the parser API and add two functions. The two new functions are slightly modified versions of existing functions. We needed the flag argument to be an input/output variable (pointer) instead of a input only variable. The rest of the code is straight forward. I like to get the review of another developer before I commit the code. Christian

5874

Age (days ago)

5879

Last active (days ago)

List overview

Download

7 comments

5 participants

participants (5)

"Martin v. Löwis"
Brett Cannon
Christian Heimes
Eric Smith
M.-A. Lemburg

Proposal: from __future__ import unicode_string_literals

tags

participants (5)

Proposal: from future import unicode_string_literals