Proposal: require 7-bit source str's

Hallvard B Furuseth h.b.furuseth at usit.uio.no
Thu Aug 5 23:38:20 CEST 2004


John Roth wrote:
>"Hallvard B Furuseth" <h.b.furuseth at usit.uio.no> wrote in message
>news:HBF.20040805p736 at bombur.uio.no...
>> Now that the '-*- coding: <charset> -*-' feature has arrived,
>> I'd like to see an addition:
>>
>>   # -*- str7bit:True -*-
>>
>>   After the source file has been converted to Unicode, cause a parse
>>   error if a non-u'' string contains a non-7bit source character.
>>
>> It can be used to ensure that the source file doesn't contain national
>> characters that the program will treat as characters in the current
>> locale's character set instead of in the source file's character set.
>> (...)
> 
> Is this even an issue? If you specify utf-8 as the character
> set, I can't see how non-unicode strings could have
> anything other than 7-bit ascii, for the simple reason that
> the interpreter wouldn't know which encoding to use.

Sorry, I should have included an example.

  # -*- coding:iso-8859-1; str7bit:True; -*-

  A = u'hør'  # ok
  B =  'hør'  # error because of str7bit.
  print B

The 'coding' directive ensures this source code is translated correctly
to Unicode.  However, string B is then translated back to the source
character set so it can be stored as a str object and not a unicode
object.

The print statement just outputs the bytes in B, it doesn't do any
character set handling.  So if your terminal uses latin-2, it will
output the 'ø' as Latin small letter r with caron.

coding:utf-8 wouldn't help.  B would remain a plain string, not a
Unicode string.  The raw utf-8 bytes would be output.

-- 
Hallvard



More information about the Python-list mailing list