[I18n-sig] Pre-PEP: Proposed Python Character Model

Wed, 07 Feb 2001 12:35:51 -0800

"Martin v. Loewis" wrote:
> 
> ...
>
> Just try
> 
>   reader = codecs.lookup("ISO-8859-2")[2]
>   charfile = reader(file)
> 
> There could be a convenience function, but that also is a detail.

Usability is not a detail in this particular case. We are trying to
change people's behavior and help them make more robust code.

>...
> My definition of "international standard" is "defined by an
> international organization", such as ISO. So ISO 8859 certainly
> qualifies. ISO 646 (aka ASCII) is also an international standard; it
> even allows for "national variants", but it does not allow
> mixed-language information. As for ISO 8859, it also supports Arabic
> and Hebrew, BTW.

That's fine. I'll change the document to be more explicit. Would you
agree that: "Unicode is the only *character set* that supports *all of
the world's major written languages.*"

> ...
> > I don't recall suggesting any such thing! chr() of a byte string should
> > return the byte value. chr() of a unicode string should return the
> > character value.
> 
> chr of a byte string? How exactly do I write this down? I.e. if I have
> chr(42), what do I get?

Sorry, I meant ord. ord of a byte string (or byte array) should return
the byte value. Ord of a character string should return the character
value.

> > Not under my proposal. file.read returns a character string. Sometimes
> > the character string contains characters between 0 and 255 and is
> > indistinguishable from today's string type. Sometimes the file object
> > knows that you want the data decoded and it returns large characters.
> 
> I guess we have to defer this until I see whether it is feasible
> (which I believe it is not - it was the mistake Sun made in the early
> JDKs).

What was the mistake?

> > I can imagine all kinds of pickle-like or structured stream file
> > formats that switch back and forth between binary information,
> > strings and unicode.
> 
> For example? If a format supports mixing binary and text information,
> it needs to specify what encoding to use for the text fragments, and
> it needs to specify how exactly conversion is performed (in case of
> stateful codecs). It is certainly the application's job to get this
> right; only the application knows how the format is supposed to work.

You and I agree that streams can change encoding mid-stream. You
probably think that should be handled by passing the stream to various
codecs as you read (or by doing double-buffer reads). I think that it
should be possible right in the read method. But I don't care enough to
argue about it.

> > > The specific syntax may be debatable; I dislike semantics being put in
> > > comments. There should be first-class syntax for that. Agree on the
> > > principle approach.
> >
> > We need a backwards-compatible syntax...
> 
> Why is that? The backwards-compatible way of writing funny bytes is to use \x escapes.

Maybe we don't need a backards-compatible syntax after all. I haven't
thought through all of those issues.

> > This is a fundamental disagreement that we will have to work through.
> > What is "questionable" about interpreting a unicode 245 as a character
> > 245? If you wanted UTF-8 you would have asked for UTF-8!!!
> 
> Likewise, if you want Latin-1 you should ask for it. Explicit is
> better than implicit.

It's funny how we switch back and forth. If I say that Python reads byte
245 into character 245 and thus uses Latin 1 as its default encoding I'm
told I'm wrong. Python has no native encoding. If I claim that in
passing data to C we should treat character 245 as the C "char" with the
value 245 you tell me that I'm proposing Latin 1 as the default
encoding.

Python has a concept of character that extends from 0 to 255. C has a
concept of character that extends from 0 to 255. There is no issue of
"encoding" as long as you stay within those ranges. This is *exactly*
like the int/long int situation. 

Once you get out of these ranges you switch the type in C to wchar_t and
you are off to the races. If you can't change the C code then that means
you work around it from the Python side -- you UTF-8 encode it before
passing it to the C code.

> ...
> That Chinese Python programmer should use his editor of choice, and
> put _() around strings that are meant as text (as opposed to strings
> that are protocol). 

I don't know what you mean by "protocol" here. But nevertheless, you are
saying that the Chinese programmer must do more than the English
programmer does and I consider that a problem.

> Yes, using type("") is a problem. I'd like to see a symbolic name
> 
> StringTypes = [StringType, UnicodeType]
> 
> in the types module.

That doesn't help to reform the mass of code out there.

 Paul Prescod