Multibyte Character Surport for Python

Tue May 14 15:09:17 EDT 2002

On 14 May 2002 10:13:24 +0200, loewis at informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=) wrote:
>
>Python code can contain non-ASCII in byte strings literals, Unicode
>string literals, and comments. For recoding, all of those places need
>to be recoded, or else no editor in the world will be able to display
>the file correctly.
>
>From the above, a Python source seems to be what might be called a multi-encoded file.
ISTM a grammar defining the composition of a multi-encoded file would
make things a lot clearer. I know this doesn't represent current practice
(I'm just trying to prime the pump ;-) but, what if we had a grammar something like, e.g.,:

    multi_encoded_file: encoded_string_packet* [bom_headed_unicode_string] ENDMARKER
    encoded_string_packet: encoded_string_packet_header string_body
    encoded_string_packet_header: '<' encoding_id ',' packet_body_length '>'
    encoding_id: 'byte' | 'ascii' | 'unicode' | 'Latin-1' | etc...
    packet_body_length: digit+
    string_body: byte* | bom_headed_unicode_string
    bom_headed_unicode_string: #<a byte sequence with any supported encoding and endianness>
    etc...

The above effectively specifies a file as a raw byte sequence that can be interpreted
as variously encoded strings. This raw byte string file can have alternate various
representations for display and editing purposes. A smart editor could deal with the
various pieces like winword treats embedded graphics and tables, etc., but that is
another discussion. For the moment, I'd just like to elicit a grammar representing
current practice for Python.

This grammar is, or IMO should be, totally orthogonal to what the purpose of the text is,
whether Python source or a Chinese novel. Of course a particular use will imply constraints
on instance structure, but the grammar per se should not be affected.

For the case of Python, it would help me (and probably others) if you would you sketch a version
of the above that reflects current design in the various representations of text used by Python.
I.e., source file, and in-memory string representations, etc.

I think it is good to remember that a Python program is (or at least I consider it as such)
an abstract entity first and variously represented second. Abstract token sequences and
visible glyph sequences and binary coded representations all have roles, but it is easy
to smear the distinctions when thinking about them. Localization should IMO not alter
abstract semantics. The possibility of dynamically generating source text and eval- or
exec-ing it is something to consider too.

Regards,
Bengt Richter