[Python-ideas] Re: prefix/suffix for bytes (was: New explicit methods to trim strings)

11 Mar 2020

      Christopher, I'm not sure how much of the following you already know, so 
excuse me in advance if I'm covering old ground for you. But hopefully 
it will be helpful for someone!

On Tue, Mar 10, 2020 at 01:18:22PM -0700, Christopher Barker wrote:
...
Getting a bit OT, but I *think* this is the story:
I've heard it argued, by folks that want to write Python software that uses
bytes for filenames, that:
A file path on a *nix system can be any string of bytes, except two special
values:
b'\x00'   : null
b'\x2f'    : slash
To be precise: file *names* cannot contain null bytes or slashes. Paths 
can contain slashes, which represent directory separators.

To be even more precise: technically this depends on the file system, 
not the OS. There are a tiny handful of file systems that support NULs 
and/or slashes in file names, although whether you could actually 
operate on those files in practice is another story.

In practice, the prohibition on NUL and slash is baked so deeply into 
the Unix world at all levels (OS, shells, applications) that even if you 
had a file system that supported them, I doubt you would be able to use 
those characters in file names.
...
So any encoding will work, as long as those two values mean the right
thing. Practically, null is always null, so that leaves the slash
So any encoding that uses b'\x2f' for the slash would work. Which seems to
include, for instance, UTF-16:
In [31]: "/".encode('utf-16')
Out[31]: b'\xff\xfe/\x00'
You probably don't want the UTF-16 BOM (byte-order-mark) in the name. So 
you want "/".encode('utf-16le') or perhaps 'utf-16be'.

But either way, that won't work, because the path contains a NUL byte.

Let's be concrete. Both of these are fine:

    >>> open('/tmp/spam', 'w')
    <_io.TextIOWrapper name='/tmp/spam' mode='w' encoding='UTF-8'>

    >>> open(b'/tmp/spam', 'w')
    <_io.TextIOWrapper name=b'/tmp/spam' mode='w' encoding='UTF-8'>

Both of those represent the same file, because:

    >>> '/tmp/spam'.encode('utf-8') == b'/tmp/spam'
    True

However, if I use UTF-16, it fails because the file name and path 
contains NUL bytes:

    >>> path = '/tmp/spam'.encode('utf-16be')
    >>> print(path)
    b'\x00/\x00t\x00m\x00p\x00/\x00s\x00p\x00a\x00m'
    >>> f = open(path, 'w')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: embedded null byte

Here are some fun and games with Unicode... 

There's no way to get the file b'/tmp/spam' from a UTF-16 string since 
it has an odd number of bytes, but I can get the file '/tmp/ spam' (note 
the leading space!) like this:

    name = ('\N{CJK UNIFIED IDEOGRAPH-742F}'
            '\N{CJK UNIFIED IDEOGRAPH-706D}'
            '\N{NARROW NO-BREAK SPACE}'
            '\N{CJK UNIFIED IDEOGRAPH-7073}'
            '\N{CJK UNIFIED IDEOGRAPH-6D61}')
    open(name.encode('utf-16le'), 'w')

That will open the file b'/tmp/ spam'. If that doesn't make you pine for 
the simpler days when the whole computing world used nothing but 
American English and liked it, then nothing will :-)

[...]
...
In practice, I suspect that every *nix system uses encoding(s) that are
ASCII-compatible for the first 127 values, or more precisely that have the
slash character value be a single byte of value: 2f. And as long as that's
the case (and that value doesn't show up anywhere else) then software can
know nothing about the encoding, and still do two things:
pass it around
split on the slash.
In practice, modern Unix shells and GUIs use UTF-8. UTF-8 has two nice 
properties:

* Every ASCII character encodes to a single byte, so text which 
  only contains ASCII values encodes to precisely the same set 
  of bytes under UTF-8 as under ASCII.

* No Unicode character, except for the Unicode NUL '\0', encodes
  to a sequence containing a null byte.

These properties are not an accident -- they were carefully designed 
that way.

In practice, the Unix OS doesn't care what encoding the user interface 
uses. It deals in bytes, and that's all it cares about.

Only the interface cares about the encoding, because if you name a file 
"Mετăl" in the shell, you expect to see the same file name in the open 
file dialog of all your GUI applications. (And visa versa.) So all the 
various shells, GUIs etc have to agree to use the same encoding or 
users get mad.
...
Which may be enough for various system tools. So a fine argument for
facilitating their use in things like globbing directories, opening files,
etc.
But as soon as you have any interaction with humans, then filenames need to
be human meaningful. and as soon as you manipulate the names beyond
splitting and merging on a slash, then you do need to know something about
the encoding.
Well... yes and no.

In practice, Unix users don't care any more about encodings than Windows 
users do.

Possibly less! Windows users have to deal with legacy systems that can 
use any of the hundreds of pre-Unicode "extended ASCII" 8 bit systems. 
If Stephen Turnbull is reading this, he can probably tell you scary 
stories about the chaos in pre-Unicode Japanese encodings like Big-5 and 
Shift-JS.

In the Linux/BSD world, pretty much everyone uses UTF-8 unless you're 
doing something unusual, like trying to extract data from some old 
Russian CSV file or ancient Macintosh text file.

In the shell, I couldn't care less about the encoding. I just name files 
whatever I want, and the shell deals with them:

    [steve@ando ~]$ touch Mετăl
    [steve@ando ~]$ ls -l M*l
    -rw-rw-r-- 1 steve steve    0 Mar 11 20:46 Mετăl

(Alas, typing those non-ASCII characters is a PITA, I ended up having to 
enter them using a GUI "Character Map" application and paste them into 
the shell.)

In Python, I never worry about encodings. I just use regular old 
Unicode strings:

    open('Mετăl')

and it Just Works. I would expect that the majority of Unix users will 
be in the same boat.

I think that the exception will be people writing applications that have 
to straddle the low-level "Unix file names are bytes" and high-level 
"you can use anything you can type as a file name" worlds.

But bytes are useful for more than just file names! Anyone writing a 
binary file format needs to deal with bytes, and I'm confident that 
there are binary formats that have optional prefixes and suffixes that 
might need to be stripped before doing further processing.

    if chunk.startswith(b'DEADBEEF'):
        chunk = chunk[8:]
    process(chunk)

One possible example: you have some binary data which may or may not 
have a NUL byte at the end. The NUL byte is redundant in Python (we're 
not C) so you want to delete it:

    if data.endswith(b'\0'):
        data = data[:-1]

-- 
Steven

[Python-ideas] Re: prefix/suffix for bytes (was: New explicit methods to trim strings)

Steven D'Aprano