[I18n-sig] Pre-PEP: Proposed Python Character Model
Martin v. Loewis
Thu, 8 Feb 2001 02:08:56 +0100
> > Just try
> > reader = codecs.lookup("ISO-8859-2")
> > charfile = reader(file)
> > There could be a convenience function, but that also is a detail.
> Usability is not a detail in this particular case. We are trying to
> change people's behavior and help them make more robust code.
Ok, just propose a specific patch; I'd recommend to add another
function to the codecs module, rather than adding another built-in.
> That's fine. I'll change the document to be more explicit. Would you
> agree that: "Unicode is the only *character set* that supports *all of
> the world's major written languages.*"
That is certainly the case.
> > > Not under my proposal. file.read returns a character string. Sometimes
> > > the character string contains characters between 0 and 255 and is
> > > indistinguishable from today's string type. Sometimes the file object
> > > knows that you want the data decoded and it returns large characters.
> > I guess we have to defer this until I see whether it is feasible
> > (which I believe it is not - it was the mistake Sun made in the early
> > JDKs).
> What was the mistake?
Java early had methods that treated Strings and byte array
interchangably if the strings had character values below 256. One
left-over from that is
public String(byte ascii, int hibyte); // in class java.lang.String
It would use the ascii array, and fill it with hibyte in-between;
hibyte was typically 0. The documentation now says
# Deprecated. This method does not properly convert bytes into
# characters. As of JDK 1.1, the preferred way to do this is via the
# String constructors that take a character-encoding name or that use
# the platform's default encoding.
The reverse operation of that is getBytes(nt srcBegin, int srcEnd,
byte dst, int dstBegin):
# Deprecated. This method does not properly convert characters into
# bytes. As of JDK 1.1, the preferred way to do this is via the
# getBytes(String enc) method, which takes a character-encoding name,
# or the getBytes() method, which uses the platform's default
I'd say your proposal is in the direction of repeating this mistake.
> You and I agree that streams can change encoding mid-stream. You
> probably think that should be handled by passing the stream to various
> codecs as you read (or by doing double-buffer reads). I think that it
> should be possible right in the read method.
Please take it as a fact that it is impossible to do that at an
arbitrary point in the stream; codecs that need to maintain state will
> It's funny how we switch back and forth. If I say that Python reads byte
> 245 into character 245 and thus uses Latin 1 as its default encoding I'm
> told I'm wrong. Python has no native encoding. If I claim that in
> passing data to C we should treat character 245 as the C "char" with the
> value 245 you tell me that I'm proposing Latin 1 as the default
Python has no default character set *in its byte string type*. Once
you have Unicode objects, talking about language-specified character
sets is meaningful.
> Python has a concept of character that extends from 0 to 255. C has a
> concept of character that extends from 0 to 255. There is no issue of
> "encoding" as long as you stay within those ranges.
C supports various character sets, depending on context. Encodings do
matter here already, e.g. when selecting fonts. Some character sets
supported in C have characters >256, even if they are stored in char*
(in particular, MBCS have these properties).
> > That Chinese Python programmer should use his editor of choice, and
> > put _() around strings that are meant as text (as opposed to strings
> > that are protocol).
> I don't know what you mean by "protocol" here.
If you do
print "GET "+url+" HTTP/1.0"
then the strings are really not meant to be human-readable, they are
part of some machine-to-machine communication protocol.
> But nevertheless, you are saying that the Chinese programmer must do
> more than the English programmer does and I consider that a problem.
It just works for the English programmer by coincidence; that
programmer should really tell apart text and byte strings in source as
Following the Unicode path, source files should be UTF-8, but that
won't work in practice because of missing editor support.