Python 3 encoding question: Read a filename from stdin, subsequently?open that filename
usenet at solar-empire.de
Tue Nov 30 11:06:45 CET 2010
Dan Stromberg <drsalists at gmail.com> wrote:
> I've got a couple of programs that read filenames from stdin, and then
> open those files and do things with them. These programs sort of do
> the *ix xargs thing, without requiring xargs.
> In Python 2, these work well. Irrespective of how filenames are
> encoded, things are opened OK, because it's all just a stream of
> single byte characters.
> In Python 3, I'm finding that I have encoding issues with characters
> with their high bit set. Things are fine with strictly ASCII
> filenames. With high-bit-set characters, even if I change stdin's
> encoding with:
> import io
> STDIN = io.open(sys.stdin.fileno(), 'r', encoding='ISO-8859-1')
> ...even with that, when I read a filename from stdin with a
> single-character Spanish n~, the program cannot open that filename
> because the n~ is apparently internally converted to two bytes, but
> remains one byte in the filesystem. I decided to try ISO-8859-1 with
> Python 3, because I have a Java program that encountered a similar
> problem until I used en_US.ISO-8859-1 in an environment variable to
> set the JVM's encoding for stdin.
> Python 2 shows the n~ as 0xf1 in an os.listdir('.'). Python 3 with an
> encoding of ISO-8859-1 wants it to be 0xc3 followed by 0xb1.
> Does anyone know what I need to do to read filenames from stdin with
> Python 3.1 and subsequently open them, when some of those filenames
> include characters with their high bit set?
Try using sys.stdin.buffer instead of sys.stdin. It gives you bytes
instead of strings. Also use byteliterals instead of stringliterals for
paths, i.e. os.listdir(b'.').
More information about the Python-list