[Python-Dev] Python-3.0, unicode, and os.environ

Ulrich Eckhardt eckhardt at satorlaser.com
Wed Dec 10 11:39:37 CET 2008

On Tuesday 09 December 2008, Adam Olsen wrote:
> On Tue, Dec 9, 2008 at 11:31 AM, Ulrich Eckhardt
> <eckhardt at satorlaser.com> wrote:
> > On Monday 08 December 2008, Adam Olsen wrote:
> >> At this point someone suggests we have a type that can store an
> >> arbitrary mix of unicode and bytes, so the undecodable portions stay
> >> in their original form. :P
> >
> > Well, not an arbitrary mix, but a type that just stores whatever comes
> > from the system without further specifying it as either bytes or Unicode:
> >
> > * If you want a string for displaying it, you first have to extract a
> > string from that thing and there you optionally specify the encoding and
> > error behaviour.
> > * If you want to append a string to it, it is automatically encoded in
> > the default encoding, which obviously can fail.
> So the 2.x str, but with a more interesting default encoding than
> ASCII.  It'll work fine on the developer's system, but one day a user
> will present it with strange input, and boom.

If the system's representation of filenames can not represent a Unicode 
codepoint that the user entered, trying to open such a file must fail. If it 
can be represented, for convenience I would allow an implicit conversion.

  for i in readdir():
      copy( i, i+".backup")

> You have to be pessimistic here.  The default operations should either
> always work or never work.  Using unicode internally and skipping
> garbage input means the operations always work.  Using a bytes API
> means mixing with unicode never works, unless the programmer
> explicitly converts, in which case the onus is on them to use proper
> error handling.

So, if I understand you correctly, you would prefer an explicit conversion to 
the system's representation:

  for i in readdir():
      copy( i, i+path(".backup"))

> The only thing separating this from a bikeshed discussion is that a
> bikeshed has many equally good solutions, while we have no good
> solutions.  Instead we're trying to find the least-bad one.  The
> unicode/bytes separation is pretty close to that.  Adding a warning
> gets even closer.  Adding magic makes it worse.

Well, I see two cases:
1. Converting from an uncertain representation to a known one.
2. Converting from a known representation to a known one.

The uncertain one is the one used by the filesystem or environment. The known 
representations are the expected(!) encoding for filesystem and environment 
and the internal text in Unicode. For case 1, I would require an explicit 
conversion to make the programmer really aware of the fact that it can fail. 
For the second case, I would allow an implicit conversion even though it can 
fail. Anyhow, that is a matter of taste, and I can actually live with your 
point of view.

However, one question still remains: What about the approach in general, i.e. 
that these texts with an uncertain representation are handled as a separate 
type? I find this much more appealing that duplicating APIs like readdir() 
using either overloading on the arguments or a separate readdirb().


Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

           Visit our website at <http://www.satorlaser.de/>
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich.


More information about the Python-Dev mailing list