[Python-Dev] #pragmas in Python source code

Thu, 13 Apr 2000 18:52:44 +0200

M.-A. Lemburg wrote:
> Fredrik Lundh wrote:
> >=20
> > M.-A. Lemburg wrote:
> > > The current need for #pragmas is really very simple: to tell
> > > the compiler which encoding to assume for the characters
> > > in u"...strings..." (*not* "...8-bit strings...").
> >=20
> > why not?
>=20
> Because plain old 8-bit strings should work just as before,
> that is, existing scripts only using 8-bit strings should not break.

but they won't -- if you don't use an encoding directive, and
don't use 8-bit characters in your string literals, everything
works as before.

(that's why the default is "none" and not "utf-8")

if you use 8-bit characters in your source code and wish to
add an encoding directive, you need to add the right encoding
directive...

> > why keep on pretending that strings and strings are two
> > different things?  it's an artificial distinction, and it only
> > causes problems all over the place.
>=20
> Sure. The point is that we can't just drop the old 8-bit
> strings... not until Py3K at least (and as Fred already
> said, all standard editors will have native Unicode support
> by then).

I discussed that in my original "all characters are unicode
characters" proposal.  in my proposal, the standard string
type will have to roles: a string either contains unicode
characters, or binary bytes.

-- if it contains unicode characters, python guarantees that
methods like strip, lower (etc), and regular expressions work
as expected.

-- if it contains binary data, you can still use indexing, slicing,
find, split, etc.  but they then work on bytes, not on chars.

it's still up to the programmer to keep track of what a certain
string object is (a real string, a chunk of binary data, an en-
coded string, a jpeg image, etc).  if the programmer wants
to convert between a unicode string and an external encoding
to use a certain unicode encoding, she needs to spell it out.
the codecs are never called "under the hood".

(note that if you encode a unicode string into some other
encoding, the result is binary buffer.  operations like strip,
lower et al does *not* work on encoded strings).

> So for now we're stuck with Unicode *and* 8-bit strings
> and have to make the two meet somehow -- which isn't all
> that easy, since 8-bit strings carry no encoding information.

in my proposal, both string types hold unicode strings.  they
don't need to carry any encoding information, because they're
not encoded.

> > > Could be that we don't need this pragma discussion at all
> > > if there is a different, more elegant solution to this...
> >=20
> > here's one way:
> >=20
> > 1. standardize on *unicode* as the internal character set.  use
> > an encoding marker to specify what *external* encoding you're
> > using for the *entire* source file.  output from the tokenizer is
> > a stream of *unicode* strings.
>=20
> Yep, that would work in Py3K...

or 1.7 -- see below.

> > 2. if the user tries to store a unicode character larger than 255
> > in an 8-bit string, raise an OverflowError.
>=20
> There are no 8-bit strings in Py3K -- only 8-bit data
> buffers which don't have string methods ;-)

oh, you've seen the Py3K specification?

> > 3. the default encoding is "none" (instead of XML's "utf-8"). in
> > this case, treat the script as an ascii superset, and store each
> > string literal as is (character-wise, not byte-wise).
>=20
> Uhm. I think UTF-8 will be the standard for text file formats
> by then... so why not make it UTF-8 ?

in time for 1.6?

or you mean Py3K?  sure!  I said that in my first "additional note",
didn't I:

> > additional notes:
> >=20
> > -- item (3) is for backwards compatibility only.  might be okay to
> > change this in Py3K, but not before that.
> >=20
> > -- leave the implementation of (1) to 1.7.  for now, assume that
> > scripts have the default encoding, which means that (2) cannot
> > happen.
>=20
> I'd say, leave all this to Py3K.

do you mean it's okay to settle for a broken design in 1.6,
since we can fix it in Py3K?  that's scary.

fixing the design is not that hard, and can be done now.

implementing all parts of it is harder, and require extensive
changes to the compiler/interpreter architecture.  but iirc,
such changes are already planned for 1.7...

> > -- we still need an encoding marker for ascii supersets (how about
> > <?python encoding=3D"utf-8" version=3D"1.6"?> ;-).  however, it's up =
to
> > the tokenizer to detect that one, not the parser.  the parser only
> > sees unicode strings.
>=20
> Hmm, the tokenizer doesn't do any string -> object conversion.
> That's a task done by the parser.

"unicode string" meant Py_UNICODE*, not PyUnicodeObject.

if the tokenizer does the actual conversion doesn't really matter;
the point is that once the code has passed through the tokenizer,
it's unicode.

</F>