PEP263 (Specifying encoding) and bytecode strings
Mike C. Fletcher
mcfletch at rogers.com
Mon May 5 16:28:24 EDT 2003
Terry Reedy wrote:
>I am a little puzzled by some of the questions and comments in this
>thread. Am I missing something?
>
Probably the purpose of resource-package :) , namely automatically
embedding sets of binary resources in Python source-code files.
>"Tony Meyer" <ta-meyer at ihug.co.nz> wrote in message
>news:mailman.1052120948.3150.python-list at python.org...
>
>
>>>>Is there some way to specify that all strings are
>>>>bytecodes, and not encoded characters?
>>>>
>>>>
>
>The value of a Python string object *is* a sequence of bytecodes.
>Character encoding is in the eye of the interpreter/user of a string.
>
Sure, until someone (with the best of intentions) decides some day that
all strings in the interpreter are Unicode (such things happen, and I'm
pretty sure I've heard rumblings from very deep in the hierarchy along
this path), and there will be a seperate "buffer" type for byte-streams.
When/if that happens, your source-file says that your binary data is
latin-1-encoded Unicode data, which makes the binary data gibberish when
the Unicode hits the fan. In essence, by declaring the data as
"latin-1", you're encoding garbage in the file so that future versions
of Python won't be able to recognise that the data is actually a
byte-stream.
What I'd like from Pep 263 is a way to make the declaration "this file
has no encoding" or "these strings are byte-sequences, *not* Unicode
data encoded with some particular encoding". Using a particular 1-byte
encoding is fine for now, but you're encoding erroneous information in
the file, which is not the best design practice. Given that Pep 263 is
already requiring 1 rewrite of all old software to support itself,
making it necessary to some day do another (to change that declaration
again), it seems somewhat... intrusive... especially if the goal is to
maintain customer confidence in Python's stability.
>>I probably phrased my question poorly: what, then, is the correct
>>encoding for the output of zlib.compress()? I know IANA has a list
>>
>>
>[1]
>
>
>>of encodings, but it's not really clear which is the right one.
>>
>>
>
>I think you are asking for *the* 'correct' fake declaration. If there
>is not yet a way to say encoding = None or encoding = bytes, then any
>one that works should be ok.
>
Escept for the GIGO principle, sure ;) . Pep 263 just has this annoying
habit of violating every aesthetic sense I have :) , from the inclusion
of semantics in comments, to breaking old code, to requiring that all
strings be converted to Unicode and back again during the parsing
process, to the inability to specify a NULL/RAW encoding. I know it's a
messy problem, but eek what a solution :) !
>Imitating the Python interpretation of source code so as to see '\xXX'
>as one byte rather than a quoted string of four Ascii chars, correct.
>So? This should only matter if you are putting compress() output or
>decompress() input into source code, such as for testing each function
>separately.
>
Which is exactly what resource-package does (though for portable access
to the files during package embedding/deployment, not necessarily
testing). For now I guess we'll use the latin-1 hack, but honestly,
that kind of kludge is not something I'm happy about including in lots
of people's files (every user of resource package will likely need to
run an upgrade script *on every embedded resource file* at some point in
the future, that's not a great joy to me). The alternative (~2x size
explosion (1/2 of bytes become 4 bytes)) really isn't that much better,
and still requires rewrites when said Unicode hits said fan.
Sigh,
Mike
_______________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://members.rogers.com/mcfletch/
More information about the Python-list
mailing list