[Python-Dev] Bytes path support
chris.barker at noaa.gov
Fri Aug 22 20:53:01 CEST 2014
On Thu, Aug 21, 2014 at 7:42 PM, Oleg Broytman <phd at phdru.name> wrote:
> On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal <
> chris.barker at noaa.gov> wrote:
> > This brings up the other key problem. If file names are (almost)
> > arbitrary bytes, how do you write one to/read one from a text file
> > with a particular encoding? ( or for that matter display it on a
> > terminal)
> There is no such thing as an encoding of text files. So we just
> write those bytes to the file
So I write bytes that are encoded one way into a text file that's encoded
another way, and expect to be abel to read that later? you're kidding,
right? Only if that's he only thing in the file -- usually not the case
with my text files.
or output them to the terminal. I often do
> that. My filesystems are full of files with names and content in
> at least 3 different encodings - koi8-r, utf-8 and cp1251. So I open a
> terminal with koi8 or utf-8 locale and fonts and some file always look
> weird. But however weird they are it's possible to work with them.
Not for me (or many other users) -- terminals are sometimes set with
ascii-only encoding, so non-ascii barfs -- or you get some weird control
characters that mess up your terminal -- dumping arbitrary bytes to a
terminal does not always "just work".
> > And people still want to say posix isn't broken in this regard?
> Not at all! And broken or not broken it's what I (for many different
> reasons) prefer to use for my desktops, servers, notebooks, routers and
Sorry -- that's a Red Herring -- I agree, "broken" or "simple and
consistent" is irrelevant, we all want Python to work as well as it can on
The point is that if you are reading a file name from the system, and then
passing it back to the system, then you can treat it as just bytes -- who
cares? And if you add the byte value of 47 thing, then you can even do
basic path manipulations. But once you want to do other things with your
file name, then you need to know the encoding. And it is very, very common
for users to need to do other things with filenames, and they almost always
want them as text that they can read and understand.
Python3 supports this case very well. But it does indeed make it hard to
work with filenames when you don't know the encoding they are in. And
apparently that's pretty common -- or common enough that it would be nice
for Python to support it well. This trick is how -- we'd like the "just
pass it around and do path manipulations" case to work with (almost)
arbitrary bytes, but everything else to work naturally with text (unicode
Which brings us to the "what APIs should accept bytes" question. I think
that's been pretty much answered: All the low-level ones, so that protocol
and library programmers can write code that works on systems with undefined
But: casual users still need to do the normal things with file names and
paths, and ideally those should work the same way on all systems.
I think the way to do this is to abstract the path concept, like pathlib
does. Back in the day, paths were "just strings", and that worked OK with
py2 strings, because you could put arbitrary bytes in them. But the "py2
strings were perfect" folks seem to not acknowledge that while they are
nice for matching the posix filename model, they were a pain in the neck
when you needed to do somethign else like write them in to a JSON file or
something. From my personal experience, non-ascii filenames are much easier
to deal with if I use unicode for filenames everywhere (py2). Somehow, I
have yet to be bitten by mixed encoding in filenames.
So will using a surrogate-escape error handling with pathlib make all this
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-Dev