[Python-3000] Unicode and OS strings

Fri Sep 14 10:56:24 CEST 2007

Greg Ewing writes:

 > Stephen J. Turnbull wrote:
 > > You can't win that, because Unicode is the only encoding that attempts
 > > to guarantee even the possibility of round-tripping.
 > 
 > Rubbish -- I can do print [ord(c) for c in my_unicode_string]
 > and get perfect round-trippability if I want.

Speaking of rubbish.  You chose the context of round-tripping *across
encodings*, not me.  Please stick with your context.

 > You can ask people to use pre-existing officially-sanctioned
 > encodings for their unicode data, but you can't force them to.

A wide variety of encodings, some standard and some not, and not
necessarily with a known injection into Unicode, is precisely what I'm
trying to deal with.  None of the other proposals, except maybe
Martin's, do.  James Knight's proposal as it stands assumes UTF-8
Unicode, while Marcin Kowalczyk's just punts to treating everything
unknown as a sequence of code units AFAICS.

 > > The main problem with this scheme that I know of is that if you have a
 > > Python string that contains such a code point, you'll need to somehow
 > > include the information about the original encoding when pickling and
 > > the like.

I was merely admitting that getting it to work *efficiently* and
*backward-compatibly* for pickling will be tricky.  But it's trivial
to get it to work *reliably*.

 > That's exactly the sort of thing I'm talking about. It
 > would be surprising if pickling worked reliably for all
 > strings *except* ones that happened to come in as a
 > command line argument.

Um, no, it's not what you're talking about.  Pickling is not currently
reliable for strings that come in as command line arguments because
Python is not reliable.  That's precisely what we're trying to fix.
None of the proposals make things worse, since they only apply in
cases where the codec would throw an exception or incorrectly decode
the argument anyway.

Yes, you could improve reliability in this sense by storing those
strings as bytes, rather than trying to make better encoding guesses
and storing "debugging info" about undecodable input.  But surely
using bytes objects is a non-starter; users are going to expect that
command-line arguments are strings, not bytes, and ASCII-only users
will raise hell if you ask them to explicitly invoke codecs to
translate command-line arguments to strings so that they can be used.