Can utf-8 encoded character contain a byte of TAB?
Peng Yu
pengyu.ut at gmail.com
Mon Jan 15 16:29:36 EST 2018
> Just to be clear, TAB *only* appears in utf-8 as the encoding for the actual TAB character, not as a part of any other character's encoding. The only bytes that can appear in the utf-8 encoding of non-ascii characters are starting with 0xC2 through 0xF4, followed by one or more of 0x80 through 0xBF.
So for utf-8 encoded input, I only need to use this code to split each
line into fields?
import sys
for line in sys.stdin:
fields=line.rstrip('\n').split('\t')
print fields
Is there a need to use this code to split each line into fields?
import sys
for line in sys.stdin:
fields=line.rstrip('\n').decode('utf-8').split('\t')
print [x.encode('utf-8') for x in fields]
--
Regards,
Peng
More information about the Python-list
mailing list