Proposal: require 7-bit source str's
Hallvard B Furuseth
h.b.furuseth at usit.uio.no
Fri Aug 6 06:52:37 EDT 2004
Peter Otten wrote:
>Hallvard B Furuseth wrote:
>>Peter Otten wrote:
>>>Hallvard B Furuseth wrote:
>>>
>>> Why would you reintroduce ambiguity with your s-prefixed
>>> strings?
>>
>> For programs that work with non-Unicode output devices or files and
>> know which character set they use. Which is quite a lot of programs.
>
> I'd say a lot of programs work with non-unicode, but many don't know what
> they are doing - i. e. you cannot move them into an environment with a
> different encoding (if you do they won't notice).
True, but for them it would probably be simpler to not use the str7bit
declaration, or to explicitly declare str7bit:False for the entire file.
>>> The long-term goal would be unicode throughout, IMHO.
>>
>> Whose long-term goal for what? For things like Internet communication,
>> fine. But there are lot of less 'global' applications where other
>> character encodings make more sense.
>
> Here we disagree. Showing the right image for a character should be
> the job of the OS and should safely work cross-platform.
Yes. What of it?
Programs that show text still need to know which character set the
source text has, so it can pass the OS the text it expects, or send a
charset directive to the OS, or whatever.
> Why shouldn't I be able to store a file with a greek or chinese name?
If you want an OS that allows that, get an OS which allows that.
> I wasn't able to quote Martin's
> surname correctly for the Python-URL. That's a mess that should be cleaned
> up once per OS rather than once per user. I don't see how that can happen
> without unicode (only). Even NASA blunders when they have to deal with
> meters and inches.
Yes, there are many non-'global' applications too where Unicode is
desirable. What of it?
Just because you want Unicode, why shouldn't I be allowed to use
other charcater encodings in cases where they are more practical?
For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
replaced with æøåÆØÅ, sorting by simple byte ordering will sort text
correctly. Unicode text _can't_ be sorted correctly, because of
characters like 'ö': Swedish 'ö' should match Norwegian 'ø' and sort
with that, while German 'ö' should not match 'ø' and sorts with 'o'.
>> In any case, a language's both short-term and long-term goals should be
>> to support current programming, not programming like it 'should be done'
>> some day in the future.
>
> Well, Python's integers already work like they 'should be done'.
And they can be used that way now.
> I'm no
> expert, but I think Java is closer to the 'real thing' concerning strings.
I don't know Java.
> Perl 6 is going for unicode, if only to overcome the limititations of their
> operator set (they want the yen symbol as a zipping operator because it
> looks like a zipper :-).
I don't know Perl 6, but Perl 5 is an excellent example of how not do to
this. So is Emacs' MULE, for that matter.
I recently had to downgrade to perl5.004 when perl5.8 broke my programs.
They worked fine until they were moved to a machine where someone had
set up the locale to use UTF-8. Then Perl decided that my data, which
has nothing at all to do with the locale, was Unicode data. I tried to
insert 'use bytes', but that didn't work. It does seem to work in newer
Perl versions, but it's not clear to me how many places I have to insert
some magic to prevent that. Nor am I interested in finding out: I just
don't trust the people who released such a piece of crap to leave my
non-Unicode strings alone. In particular since _most_ of the strings
are UTF-8, so I wonder if Perl might decide to do something 'friendly'
with them.
> You have to make compromises and I think an external checker would be
> the way to go in your case. If I were to add a switch to Python's
> string handling it would be "all-unicode".
Meaning what?
> But it may well be that I would curse it after the first real-world
> use...
--
Hallvard
More information about the Python-list
mailing list