Can utf-8 encoded character contain a byte of TAB?
Peng Yu
pengyu.ut at gmail.com
Mon Jan 15 09:11:02 EST 2018
Hi,
I use the following code to process TSV input.
$ printf '%s\t%s\n' {1..10} | ./main.py
['1', '2']
['3', '4']
['5', '6']
['7', '8']
['9', '10']
$ cat main.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:
import sys
for line in sys.stdin:
fields=line.rstrip('\n').split('\t')
print fields
But I am not sure it will process utf-8 input correctly. Thus, I come
up with this code. However, I am not sure if this is really necessary
as my impression is that utf-8 character should not contain the ascii
code for TAB. Is it so? Thanks.
$ cat main1.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:
import sys
for line in sys.stdin:
#fields=line.rstrip('\n').split('\t')
fields=line.rstrip('\n').decode('utf-8').split('\t')
print [x.encode('utf-8') for x in fields]
$ printf '%s\t%s\n' {1..10} | ./main1.py
['1', '2']
['3', '4']
['5', '6']
['7', '8']
['9', '10']
--
Regards,
Peng
More information about the Python-list
mailing list