[issue4006] os.getenv silently discards env variables with non-UTF-8 values

Toshio Kuratomi report at bugs.python.org
Mon Nov 24 17:49:17 CET 2008


Toshio Kuratomi <a.badger at gmail.com> added the comment:

> The bug tracker is maybe not the right place to discuss a new Python3
feature.

It's a bug!  But if you guys want it to be a feature, then what mailing
list do I need to join?  Is there one devoted to Unicode or is
python-dev where I need to go?

>> 1) return mixed unicode and byte types in os.environ
>One goal of Python3 was to avoid mixing bytes and characters (bytes/str).

As stated, in my evaluation of the four options, +1 to this, option #1 takes
us back to the problems encountered in python-2.

>> 2) return only byte types in os.environ
> os.environ contains text (characters) and so should decoded as unicode.

This is correct but is not accurate :-)  os.environ, the python variable,
contains only unicode because that's the way it's coded.  However, the Unix
environment which os.environ attempts to give access to contains bytes which
are almost always representable as characters.  The two caveats are:

1) There's nothing that constrains it to characters -- putting byte
sequences
   that do not include null in the environment is valid.

2) The characters in the environment may be mixed encodings, sometimes
due to
   things outside of the user's control.

>> 3) raise an exception if someone attempts to access an environment
>> variable that cannot be decoded to unicode via the system encoding and
>> allow the value to be accessed as a byte string via another method.
>> 4) silently ignore the non-decodable variables when accessing os.environ
>> the normal way but have another method of accessing it that returns all
>> values as byte strings.
>
> Why not for (3).
"""

Do you mean, "I support 3"?  Or did you not finish a thought here?

> But what would be the "another method" (4) to access byte 
> string? The problem of having two methods is that you need consistent 
> objects.

This is exactly the problem I was talking about in my analysis of #4 in the
previous comment.  This problem plagues the new os.listdir() method as
well by
introducing a construct that programmers can use that doesn't give all the
information (os.listdir('.')) but also doesn't warn the programmer when the
information is not being shown.

> Imagine that you have os.environ (unicode) and os.environb (bytes).
> 
> Example 1:
>   os.environb['PATH'] = b'\xff\xff\xff\xff'
> What is the value in os.environ['PATH']?

Since option 4 mimics the os.listdir() method, accesing os.environ['PATH']
would give you a KeyError.  ie, the value was silently dropped just as
os.listdir('.') does.

> Example 2:
>   os.environb['PATH'] = b'têst'
> What is the value in os.environ['PATH']?

This doesn't work in python3 since byte strings can only be ASCii literals.

> Example 3:
>   os.environ['PATH'] = 'têst'
> What is the value in os.environb['PATH']?

Dependent on the default system encoding.  Assuming utf-8 encoding,
os.environb['PATH'] == b't\xc3\xaast'

> Example 4:
>  should I use os.environ['PATH'] or os.environb['PATH'] to get the current
>  PATH?

Should you use os.listdir('.') or os.listdir(b'.') to get the list of
files in
the current directory?

This is where treating pathnames, environment variables and etc as strings
instead of bytes becomes non-simple.  Now you have to decide what you really
want to know (and possibly keep two slightly different values if you want to
know two things.)

If you want to keep the path in order to look up commands that the user can
run you want os.environb['PATH'] since this is exactly what the shell
will use
when the user types a command at the commandline.

If you want to display the elements of the PATH for the user, you probably
want this::
  try:
      path = os.environ['PATH'].split(':')
  except KeyError:
      try:
          temp_path = os.environ['PATH'].split(b':')
      except KeyError:
          path = DEFAULT_PATH
      else:
          path = []
          for directory in os.environ['PATH'].split(b':'):
              path.append(unicode(directory,
                      sys.getdefaultencoding(), 'replace'))

> It introduces many new cases (bugs?) that have to be prepared and tested.

Those bugs are *already present*.  Without taking one of the four options,
there's simply no way to code a solution.  Take the above code and imagine
that there's no way to access the user's PATH variable when a
non-default-encoding character is present in the PATH.  That means that
you're
always stuck with the value of DEFAULT_PATH instead of being able to display
something reasonable to the user.

(Note, these examples are pretty much the same for option #3 or option
#4.  The
value of option #3 becomes apparent when you use os.getenv('PATH')
instead of
os.environ['PATH'])

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue4006>
_______________________________________


More information about the Python-bugs-list mailing list