Can utf-8 encoded character contain a byte of TAB?
Peter Otten
__peter__ at web.de
Mon Jan 15 09:35:22 EST 2018
Peng Yu wrote:
> Can utf-8 encoded character contain a byte of TAB?
Yes; ascii is a subset of utf8.
Python 2.7.6 (default, Nov 23 2017, 15:49:48)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> ascii = "".join(map(chr, range(128)))
>>> uni = ascii.decode("utf-8")
>>> len(uni)
128
>>> assert map(ord, uni) == range(128)
If you want to allow fields containing TABs in a file where TAB is also the
field separator you need a convention to escape the TABs occuring in the
values. Nothing I see in your post can cope with that, but the csv module
can, by quoting field containing the delimiter:
>>> import csv, sys
>>> csv.writer(sys.stdout, delimiter="\t").writerow(["foo", "bar\tbaz"])
foo "bar baz"
>>> next(csv.reader(['foo\t"bar\tbaz"\n'], delimiter="\t"))
['foo', 'bar\tbaz']
> Hi,
>
> I use the following code to process TSV input.
>
> $ printf '%s\t%s\n' {1..10} | ./main.py
> ['1', '2']
> ['3', '4']
> ['5', '6']
> ['7', '8']
> ['9', '10']
> $ cat main.py
> #!/usr/bin/env python
> # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1
> # fileencoding=utf-8:
>
> import sys
> for line in sys.stdin:
> fields=line.rstrip('\n').split('\t')
> print fields
>
> But I am not sure it will process utf-8 input correctly. Thus, I come
> up with this code. However, I am not sure if this is really necessary
> as my impression is that utf-8 character should not contain the ascii
> code for TAB. Is it so? Thanks.
>
> $ cat main1.py
> #!/usr/bin/env python
> # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1
> # fileencoding=utf-8:
>
> import sys
> for line in sys.stdin:
> #fields=line.rstrip('\n').split('\t')
> fields=line.rstrip('\n').decode('utf-8').split('\t')
> print [x.encode('utf-8') for x in fields]
>
> $ printf '%s\t%s\n' {1..10} | ./main1.py
> ['1', '2']
> ['3', '4']
> ['5', '6']
> ['7', '8']
> ['9', '10']
>
>
More information about the Python-list
mailing list