Re[Python-Dev] #pragmas in Python source code

M.-A. Lemburg mal@lemburg.com
Fri, 14 Apr 2000 23:22:08 +0200


Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal@lemburg.com> wrote:
> > > but they won't -- if you don't use an encoding directive, and
> > > don't use 8-bit characters in your string literals, everything
> > > works as before.
> > >
> > > (that's why the default is "none" and not "utf-8")
> > >
> > > if you use 8-bit characters in your source code and wish to
> > > add an encoding directive, you need to add the right encoding
> > > directive...
> >
> > Fair enough, but this would render all the auto-coercion
> > code currently in 1.6 useless -- all string to Unicode
> > conversions would have to raise an exception.
> 
> I though it was rather clear by now that I think the auto-
> conversion stuff *is* useless...
>
> but no, that doesn't mean that all string to unicode conversions
> need to raise exceptions -- any 8-bit unicode character obviously
> fits into a 16-bit unicode character, just like any integer fits in a
> long integer.
> 
> if you convert the other way, you might get an OverflowError, just
> like converting from a long integer to an integer may give you an
> exception if the long integer is too large to be represented as an
> ordinary integer.  after all,
> 
>     i = int(long(v))
> 
> doesn't always raise an exception...

This is exactly the same as proposing to change the default
encoding to Latin-1.

I don't have anything against that (being a native Latin-1
user :), but I would assume that other native language
writer sure do: e.g. all programmers not using Latin-1
as native encoding (and there are lots of them).

> > > > > why keep on pretending that strings and strings are two
> > > > > different things?  it's an artificial distinction, and it only
> > > > > causes problems all over the place.
> > > >
> > > > Sure. The point is that we can't just drop the old 8-bit
> > > > strings... not until Py3K at least (and as Fred already
> > > > said, all standard editors will have native Unicode support
> > > > by then).
> > >
> > > I discussed that in my original "all characters are unicode
> > > characters" proposal.  in my proposal, the standard string
> > > type will have to roles: a string either contains unicode
> > > characters, or binary bytes.
> > >
> > > -- if it contains unicode characters, python guarantees that
> > > methods like strip, lower (etc), and regular expressions work
> > > as expected.
> > >
> > > -- if it contains binary data, you can still use indexing, slicing,
> > > find, split, etc.  but they then work on bytes, not on chars.
> > >
> > > it's still up to the programmer to keep track of what a certain
> > > string object is (a real string, a chunk of binary data, an en-
> > > coded string, a jpeg image, etc).  if the programmer wants
> > > to convert between a unicode string and an external encoding
> > > to use a certain unicode encoding, she needs to spell it out.
> > > the codecs are never called "under the hood".
> > >
> > > (note that if you encode a unicode string into some other
> > > encoding, the result is binary buffer.  operations like strip,
> > > lower et al does *not* work on encoded strings).
> >
> > Huh ? If the programmer already knows that a certain
> > string uses a certain encoding, then he can just as well
> > convert it to Unicode by hand using the right encoding
> > name.
> 
> I thought that was what I said, but the text was garbled.  let's
> try again:
> 
>     if the programmer wants to convert between a unicode
>     string and a buffer containing encoded text, she needs
>     to spell it out.  the codecs are never called "under the
>     hood"

Again and again...

The orginal intent of the Unicode integration was trying to
make Unicode and 8-bit strings interoperate without too
much user intervention. At a cost (the UTF-8 encoding), but
then if you do use this encoding (and this is not far fetched
since there are input sources which do return UTF-8, e.g.
TCL), the Unicode implementation will apply all its knowledge
in order to get you satisfied. 

If you don't like this, you can always apply explicit
conversion calls wherever needed. Latin-1 and UTF-8
are not compatible, the conversion is very likely to 
cause an exception, so the user will indeed be informed
about this failure.
 
> > The whole point we are talking about here is that when
> > having the implementation convert a string to Unicode all
> > by itself it needs to know which encoding to use. This is
> > where we have decided long ago that UTF-8 should be
> > used.
> 
> does "long ago" mean that the decision cannot be
> questioned?  what's going on here?
> 
> face it, I don't want to guess when and how the interpreter
> will convert strings for me.  after all, this is Python, not Perl.
> 
> if I want to convert from a "string of characters" to a byte
> buffer using a certain character encoding, let's make that
> explicit.

Hey, there's nothing which prevents you from doing so
explicitly.
 
> Python doesn't convert between other data types for me, so
> why should strings be a special case?

Sure it does: 1.5 + 2 == 3.5, 2L + 3 == 5L, etc...
 
> > The pragma discussion is about a totally different
> > issue: pragmas could make it possible for the programmer
> > to tell the *compiler* which encoding to use for literal
> > u"unicode" strings -- nothing more. Since "8-bit" strings
> > currently don't have an encoding attached to them we store
> > them as-is.
> 
> what do I have to do to make you read my proposal?
> 
> shout?
> 
> okay, I'll try:
> 
>     THERE SHOULD BE JUST ONE INTERNAL CHARACTER
>     SET IN PYTHON 1.6: UNICODE.

Please don't shout... simply read on...

Note that you are again argueing for using Latin-1 as
default encoding -- why don't you simply make this fact
explicit ?
 
> for consistency, let this be true for both 8-bit and 16-bit
> strings (as well as Py3K's 31-bit strings ;-).
> 
> there are many possible external string encodings, just like there
> are many possible external integer encodings.   but for integers,
> that's not something that the core implementation cares much
> about.  why are strings different?
> 
> > I don't want to get into designing a completely new
> > character container type here... this can all be done for Py3K,
> > but not now -- it breaks things at too many ends (even though
> > it would solve the issues with strings being used in different
> > contexts).
> 
> you don't need to -- you only need to define how the *existing*
> string type should be used.  in my proposal, it can be used in two
> ways:
> 
> -- as a string of unicode characters (restricted to the
>    0-255 subset, by obvious reasons).  given a string 's',
>    len(s) is always the number of characters, s[i] is the
>    i'th character, etc.
> 
> or
> 
> -- as a buffer containing binary bytes. given a buffer 'b',
>    len(b) is always the number of bytes, b[i] is the i'th
>    byte, etc.
> 
> this is one flavour less than in the 1.6 alphas -- where strings sometimes
> contain UTF-8 (and methods like upper etc doesn't work), sometimes an
> 8-bit character set (and upper works), and sometimes binary buffers (for
> which upper doesn't work).

Strings always contain data -- there's no encoding attached
to them. If the user calls .upper() on a binary string the
output will most probably no longer be usable... but that's
the programmers fault, not the string type's fault.
 
> (hmm.  I've said all this before, haven't I?)

You know as well as I do that the existing string type
is used for both binary and text data. You cannot simply change
this by introducing some new definition of what should
be stored in buffers and what in strings... not until we
officially redefined these things say in Py3K ;-)

> frankly, I'm beginning to feel like John Skaller.  do I have to write my
> own interpreter to get this done right? :-(

No, but you should have started this discussion in late
November last year... not now, when everything has already
been implemented and people are starting to the use the
code that's there with great success.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/