Folks: My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary binary names from the filesystem and store them so that I can regenerate the same byte string later, but it also requires that I *know* whether what I got was a valid string in the expected encoding (which might be utf-8) or whether it was not and I need to fall back to storing the bytes. So far, it looks like PEP 383 doesn't provide both of these requirements, so I am going to have to continue working-around the Python API even after PEP 383. In fact, it might actually increase the amount of working-around that I have to do. If I understand correctly, .decode(encoding, 'strict') will not be changed by PEP 383. A new error handler is added, so .decode('utf-8', 'python-escape') performs the utf-8b decoding. Am I right so far? Therefore if I have a string of bytes, I can attempt to decode it with 'strict', and if that fails I can set the flag showing that it was not a valid byte string in the expected encoding, and then I can invoke .decode('utf-8', 'python-escape') on it. So far, so good. (Note that I never want to do .decode(expected_encoding, 'python-escape') -- if it wasn't a valid bytestring in the expected_encoding, then I want to decode it with utf-8b, regardless of what the expected encoding was.) Anyway, I can use it like this: class FName: def __init__(self, name, failed_decode=False): self.name = name self.failed_decode = failed_decode def fs_to_unicode(bytes): try: return FName(bytes.decode(sys.getfilesystemencoding(), 'strict')) except UnicodeDecodeError: return FName(fn.decode('utf-8', 'python-escape'), failed_decode=True) And what about unicode-oriented APIs such as os.listdir()? Uh-oh, the PEP says that on systems with locale 'utf-8', it will automatically be changed to 'utf-8b'. This means I can't reliably find out whether the entries in the directory *were* named with valid encodings in utf-8? That's not acceptable for my use case. I would have to refrain from using the unicode-oriented os.listdir() on POSIX, and instead do something like this: if platform.system() in ('Windows', 'Darwin'): def listdir(d): return [FName(n) for n in os.listdir(d)] elif platform.system() in ('Linux', 'SunOs'): def listdir(d): bytesd = d.encode(sys.getfilesystemencoding()) return [fs_to_unicode(n) for n in os.listdir(bytesd)] else: raise NotImplementedError("Please classify platform.system() == %s \ as either unicode-safe or unicode-unsafe." % platform.system()) In fact, if 'utf-8' gets automatically converted to 'utf-8b' when *decoding* as well as encoding, then I would have to change my fs_to_unicode() function to check for that and make sure to use strict utf-8 in the first attempt: def fs_to_unicode(bytes): fse = sys.getfilesystemencoding() if fse == 'utf-8b': fse = 'utf-8' try: return FName(bytes.decode(fse, 'strict')) except UnicodeDecodeError: return FName(fn.decode('utf-8', 'python-escape'), failed_decode=True) Would it be possible for Python unicode objects to have a flag indicating whether the 'python-escape' error handler was present? That would serve the same purpose as my "failed_decode" flag above, and would basically allow me to use the Python APIs directory and make all this work-around code disappear. Failing that, I can't see any way to use the os.listdir() in its unicode-oriented mode to satisfy Tahoe's requirements. If you take the above code and then add the fact that you want to use the failed_decode flag when *encoding* the d argument to os.listdir(), then you get this code: [2]. Oh, I just realized that I *could* use the PEP 383 os.listdir(), like this: def listdir(d): fse = sys.getfilesystemencoding() if fse == 'utf-8b': fse = 'utf-8' ns = [] for fn in os.listdir(d): bytes = fn.encode(fse, 'python-escape') try: ns.append(FName(bytes.decode(fse, 'strict'))) except UnicodeDecodeError: ns.append(FName(fn.decode('utf-8', 'python-escape'), failed_decode=True)) return ns (And I guess I could define listdir() like this only on the non-unicode-safe platforms, as above.) However, that strikes me as even more horrible than the previous "listdir()" work-around, in part because it means decoding, re-encoding, and re-decoding every name, so I think I would stick with the previous version. Oh, one more note: for Tahoe's purposes you can, in all of the code above, replace ".decode('utf-8', 'python-replace')" with ".decode('windows-1252')" and it works just as well. While UTF-8b seems like a really cool hack, and it would produce more legible results if utf-8-encoded strings were partially corrupted, I guess I should just use 'windows-1252' which is already implemented in Python 2 (as well as in all other software in the world). I guess this means that PEP 383, which I have approved of and liked so far in this discussion, would actually not help Tahoe at all and would in fact harm Tahoe -- I would have to remember to detect and work-around the automatic 'utf-8b' filesystem encoding when porting Tahoe to Python 3. If anyone else has a concrete, real use case which would be helped by PEP 383, I would like to hear about it. Perhaps Tahoe can learn something from it. Oh, if this PEP could be extended to add a flag to each unicode object indicating whether it was created with the python-escape handler or not, then it would be useful to me. Regards, Zooko [1] http://mail.python.org/pipermail/python-dev/2009-April/089020.html [2] http://allmydata.org/trac/tahoe/attachment/ticket/534/fsencode.3.py