UTF-8 and stdin/stdout?

Wed May 28 05:16:56 EDT 2008

dave_140390 at hotmail.com writes:

> Hi,
>
> I have problems getting my Python code to work with UTF-8 encoding
> when reading from stdin / writing to stdout.
>
> Say I have a file, utf8_input, that contains a single character, é,
> coded as UTF-8:
>
> 	$ hexdump -C utf8_input
> 	00000000  c3 a9
> 	00000002
>
> If I read this file by opening it in this Python script:
>
> 	$ cat utf8_from_file.py
> 	import codecs
> 	file = codecs.open('utf8_input', encoding='utf-8')
> 	data = file.read()
> 	print "length of data =", len(data)
>
> everything goes well:
>
> 	$ python utf8_from_file.py
> 	length of data = 1
>
> The contents of utf8_input is one character coded as two bytes, so
> UTF-8 decoding is working here.
>
> Now, I would like to do the same with standard input. Of course, this:
>
> 	$ cat utf8_from_stdin.py
> 	import sys
> 	data = sys.stdin.read()
> 	print "length of data =", len(data)

Shouldn't you do data = data.decode('utf8') ?

> does not work:
>
> 	$ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
> 	length of data = 2

-- 
Arnaud