[Python-3000] setup.py fails in the py3k-struni branch

Ron Adam rrr at ronadam.com
Fri Jun 15 06:51:10 CEST 2007



Guido van Rossum wrote:
> On 6/13/07, Ron Adam <rrr at ronadam.com> wrote:
>> Well I can see where a str8() type with an __incoded_with__ attribute 
>> could
>> be useful.  It would use a bit more memory, but it won't be the
>> default/primary string type anymore so maybe it's ok.
>>
>> Then bytes can be bytes, and unicode can be unicode, and str8 can be
>> encoded strings for interfacing with the outside non-unicode world.  Or
>> something like that. <shrug>
> 
> Hm... Requiring each str8 instance to have an encoding might be a
> problem -- it means you can't just create one from a bytes object.
> What would be the use of this information? What would happen on
> concatenation? On slicing? (Slicing can break the encoding!)

Round trips to and from bytes should work just fine.  Why would that be a 
problem?

There really is no safety in concatenation and slicing of encoded 8bit 
strings now.  If by accident two strings of different encodings are 
combined, then all bets are off.  And since there is no way to ask a string 
what it's current encoding is, it becomes an easy to make and hard to find 
silent error.  So we have to be very careful not to mix encoded strings 
with different encodings.

It's not too different from trying to find the current unicode and str8 
issues in the py3k-struni branch.

Concatenating str8 and str types is a bit safer, as long as the str8 is in 
in "the" default encoding, but it may still be an unintended implicit 
conversion.  And if it's not in the default encoding, then all bets are off 
again.

The use would be in ensuring the integrity of encoded strings. 
Concatenating strings with different encodings could then produce errors. 
Explicit casting could automatically decode and encode as needed.  Which 
would eliminate a lot of encode/decode confusion.

This morning I was thinking all of this could be done as a module that 
possibly uses metaclass's or mixins to create encoded string types.  Then 
it wouldn't need an attribute on the instances.  Possibly someone has 
already did something along that lines?

But Back to the issues at hand...

>> Attached both the str8 repr as s"..." and s'...', and the latest
>> no_raw_escape patch which I think is complete now and should apply 
>> with no
>> problems.
> 
> I like the str8 repr patch enough to check it in.
> 
>> I tracked the random fails I am having in test_tokenize.py down to it 
>> doing
>> a round trip on random test_*.py files.  If one of those files has a
>> problem it causes test_tokanize.py to fail also.  So I added a line to 
>> the
>> test to output the file name it does the round trip on so those can be
>> fixed as they are found.
>>
>> Let me know it needs to be adjusted or something doesn't look right.
> 
> Well, I'm still philosophically uneasy with r'\' being a valid string
> literal, for various reasons (one being that writing a string parser
> becomes harder and harder).

Hmmm..  It looks to me the thing that makes it somewhat hard is in 
determining weather or not its a single-quote, empty-single-quote, or 
triple-quote string.  I made some improvements to that in tokenize.c 
although it may not be clear from just looking at the unified diff.

After that, it was just a matter of checking a !is_raw_str flag before 
always blindly accepting the following character.

Before that it was a matter of doing that, and checking the quote type 
status, as well which wasn't intuitive since the string parsing loop was 
entered before the beginning quote type was confirmed.

I can remove the raw string flag and flag-check and leave the other changes 
in or revert the whole file back. Any preference?  The later makes it an 
easy approximate three line change to add r'\' support back in.

I'll have to look at tokanize.py again to see what needs to be done there. 
It uses regular expressions to parse the file.

I definitely want r'\u1234' to be a
> 6-character string, however. Do you have a patch that does just that?
> (We can argue over the rest later in a larger forum.)

I can split the patch into two patches. And the second allow escape at end 
of strings patch can be reviewed later.

What about br'\'?  Should that be excluded also?

Ron











More information about the Python-3000 mailing list