[Python-Dev] Divorcing str and unicode (no more implicit conversions).

Sun Oct 23 01:49:53 CEST 2005

Please bear with me for a few paragraphs ;-)

One aspect of str-type strings is the efficiency afforded when all the encoding really
is ascii. If the internal encoding were e.g. fixed utf-16le for strings, maybe with today's
computers it would still be efficient enough for most actual string purposes (excluding
the current use of str-strings as byte sequences).

I.e., you'd still have to identify what was "strings" (of characters) and what was really
byte sequences with no implied or explicit encoding or character semantics.

Ok, let's make that distinction explicit: Call one kind of string a byte sequence and the
other a character sequence (representation being a separate issue).

A unicode object is of course the prime _general_ representation of a character sequence
in Python, but all the names in python source code (that become NAME tokens) are UIAM
also character sequences, and representable by a byte sequence interpreted according to
ascii encoding.

For the sake of discussion, suppose we had another _character_ sequence type that was
the moral equivalent of unicode except for internal representation, namely a str
subclass with an encoding attribute specifying the encoding that you _could_ use
to decode the str bytes part to get unicode (which you wouldn't do except when necessary).
We could call it class charstr(str): ... and have chrstr().bytes be the str part and
chrstr().encoding specify the encoding part.

In all the contexts where we have obvious encoding information, we can then generate
a charstr instead of a str. E.g., if the source of module_a has

    # -*- coding: latin1 -*-
    cs = 'über-cool'
then
    type(cs)  # => <type 'charstr'>
    cs.bytes  # => '\xfcber-cool'
    cs.encoding # => 'latin-1'

and print cs would act like print cs.bytes.decode(cs.encoding) -- or I guess
    sys.stdout.write(cs.bytes.decode(cs.encoding).encode(sys.stdout.encoding)
followed by
    sys.stdout.write('\n'.decode('ascii').encode(sys.stdout.encoding)
for the newline of the print.

Now if module_b has

    # -*- coding: utf8 -*-
    cs = 'über-cool'

and we interactively
    import module_a, module_b
and then
    print module_a.cs + ' =?= ' + module_b.cs

what could happen ideally vs. what we have currently?
UIAM, currently we would just get the concatenation of
the three str byte sequences concatenated to make
    '\xfcber-cool =?= \xc3\xbcber-cool'
and that would be printed as whatever that comes out as
without conversion when seen by the output according to
sys.stdout.encoding.

But if those cs instances had been charstr instances, the coding cookie
encoding information would have been preserved, and the interactive print could
have evaluated the string expression -- given cs.decode() as sugar for
    (cs.bytes.decode(cs.encoding or globals().get('__encoding__') or
         __import__('sys').getdefaultencoding()))
-- as

    module_a.cs.decode() + ' =?= '.decode() + module_b.cs.decode()

if pairwise terms differ in encoding as they might all here. If the interactive
session source were e.g. latin-1, like module_a, then
    module_a.cs + ' =?= '
would not require an encoding change, because the ' =?= ' would be a charstr instance
with encoding == 'latin-1', and so the result would still be latin-1 that far.
But with module_b.cs being utf8, the next addition would cause the .decode() promotions
to unicode. In a console window, the ' =?= '.encoding might be 'cp437' or such, and
the first addition would then cause promotion (since module_a.cs.encoding != 'cp437').

I have sneaked in run-time access to individual modules' encodings by assuming that
the encoding cookie could be compiled in as an explicit global __encoding__ variable
for any given module (what to have as __encoding__ for built-in modules could vary for
various purposes).

ISTM this could have use in situations where an encoding assumption is necessary and
currently 'ascii' is not as good a guess as one could make, though I suspect if string
literals became charstr strings instead of str strings, many if not most of those situations
would disappear (I'm saying this because ATM I can't think of an 'ascii'-guess situation that
wouldn't go away ;-) If there were a charchr() version of chr() that would result in
a charstr instead of a str, IWT one would want an easy-sugar default encoding assumption,
probably based on the same as one would assume for '%c' % num in a given module source
-- which presumably would be '%c'.encoding, where '%c' assumes the encoding of the module
source, normally recorded in __encoding__. So charchr(n) would act like chr(n).decode().encode(''.encoding) -- or more reasonably charstr(chr(n)), which would be
short for
    charstr(chr(n), globals().get('__encoding__') or __import__('sys').getdefaultencoding())
Or some efficient equivalent ;-)

Using strings in dicts requires hashing to find key comparison candidates and comparison to
check for key equivalence. This would seem to point to some kind of normalized hashing, but
not necessarily normalized key representation. Some is apparently happening, since
 >>> hash('a') == hash(unicode('a'))
 True

I don't know what would be worth the trouble to optimize string key usage where strings are
really all of one encoding vs totally general use vs a heavily biased mix. Or even if it could
be done without unreasonable complexity. Maybe a dict could be given an option to hash all
its keys as unicode vs whatever it does now. But having a charstr subtype of str would improve
the "implicit" conversions to unicode IMO.

Anyway, I wanted to throw in my .02USD re the implicit conversions, taking the view that
much of the implicitness could be based on reliable inferences from source encodings of
string literals or from their effects as format strings.

Regards,
Bengt Richter
[not a normal subscriber to python-dev, so I'll have to google for any responses]