Proposal: require 7-bit source str's

Fri Aug 6 06:52:37 EDT 2004

Peter Otten wrote:
>Hallvard B Furuseth wrote:
>>Peter Otten wrote:
>>>Hallvard B Furuseth wrote:
>>> 
>>> Why would you reintroduce ambiguity with your s-prefixed
>>> strings?
>> 
>> For programs that work with non-Unicode output devices or files and
>> know which character set they use.  Which is quite a lot of programs.
> 
> I'd say a lot of programs work with non-unicode, but many don't know what
> they are doing - i. e. you cannot move them into an environment with a
> different encoding (if you do they won't notice).

True, but for them it would probably be simpler to not use the str7bit
declaration, or to explicitly declare str7bit:False for the entire file.

>>> The long-term goal would be unicode throughout, IMHO.
>> 
>> Whose long-term goal for what?  For things like Internet communication,
>> fine.  But there are lot of less 'global' applications where other
>> character encodings make more sense.
> 
> Here we disagree. Showing the right image for a character should be
> the job of the OS and should safely work cross-platform.

Yes.  What of it?

Programs that show text still need to know which character set the
source text has, so it can pass the OS the text it expects, or send a
charset directive to the OS, or whatever.

> Why shouldn't I be able to store a file with a greek or chinese name?

If you want an OS that allows that, get an OS which allows that.

> I wasn't able to quote Martin's
> surname correctly for the Python-URL. That's a mess that should be cleaned
> up once per OS rather than once per user. I don't see how that can happen
> without unicode (only). Even NASA blunders when they have to deal with
> meters and inches.

Yes, there are many non-'global' applications too where Unicode is
desirable.  What of it?

Just because you want Unicode, why shouldn't I be allowed to use
other charcater encodings in cases where they are more practical?

For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
replaced with æøåÆØÅ, sorting by simple byte ordering will sort text
correctly.  Unicode text _can't_ be sorted correctly, because of
characters like 'ö': Swedish 'ö' should match Norwegian 'ø' and sort
with that, while German 'ö' should not match 'ø' and sorts with 'o'.

>> In any case, a language's both short-term and long-term goals should be
>> to support current programming, not programming like it 'should be done'
>> some day in the future.
> 
> Well, Python's integers already work like they 'should be done'.

And they can be used that way now.

> I'm no
> expert, but I think Java is closer to the 'real thing' concerning strings.

I don't know Java.

> Perl 6 is going for unicode, if only to overcome the limititations of their
> operator set (they want the yen symbol as a zipping operator because it
> looks like a zipper :-). 

I don't know Perl 6, but Perl 5 is an excellent example of how not do to
this.  So is Emacs' MULE, for that matter.

I recently had to downgrade to perl5.004 when perl5.8 broke my programs.
They worked fine until they were moved to a machine where someone had
set up the locale to use UTF-8.  Then Perl decided that my data, which
has nothing at all to do with the locale, was Unicode data.  I tried to
insert 'use bytes', but that didn't work.  It does seem to work in newer
Perl versions, but it's not clear to me how many places I have to insert
some magic to prevent that.  Nor am I interested in finding out: I just
don't trust the people who released such a piece of crap to leave my
non-Unicode strings alone.  In particular since _most_ of the strings
are UTF-8, so I wonder if Perl might decide to do something 'friendly'
with them.

> You have to make compromises and I think an external checker would be
> the way to go in your case. If I were to add a switch to Python's
> string handling it would be "all-unicode".

Meaning what?

> But it may well be that I would curse it after the first real-world
> use...

-- 
Hallvard