[Python-ideas] TextIOWrapper callable encoding parameter
Rurpy
rurpy at yahoo.com
Mon Jun 11 17:06:18 CEST 2012
As a followup, here are some timing data that seem to confirm
a modest increase in speed as a result of implementing the
callable encoding parameter I proposed (although that would
not be the main reason for wanting to do it.) These are just
for illustration. (Among many other reasons, _pyio benchmarks
are not very useful.)
I read four short test files using four methods for determining
the test file's encoding. The test files are a simplified model
of a python coding declaration (always on first line in our case
with no BOM present [*1]) followed by mixed english and japanese
text.
Method 0 (reopen0):
Use the encoding callable I am proposing.
def reopen0 (fname):
def hook (data,buf):
return get_encoding (data)
t = io.open (fname, encoding=hook)
Method 1 (reopen1):
Open in binary to determine encoding, then rewrap in a
TextIOWrapper with the correct encoding.
def reopen1 (fname):
b = io.open (fname, 'rb')
line = b.readline()
enc = get_encoding (line)
b.seek (0)
t = io.TextIOWrapper (b, enc, line_buffering=True)
t.mode = 'r'
Method 2 (reopen2):
Open in binary to determine encoding, then reopen in text mode
with correct encoding.
def reopen2 (fname):
b = io.open (fname, 'rb')
line = b.readline()
enc = get_encoding (line)
t = io.open (fname, encoding=enc)
Method 3 (reopen3):
Open in text mode (latin1) to determine encoding, then reopen
in text mode with correct encoding.
def reopen3 (fname):
f = io.open (fname, encoding='latin1')
line = f.readline()
enc = get_encoding (line)
t = io.open (fname, encoding=enc)
The same get_encoding() function is used in all methods [*1].
The input test data are all small files (because we want
to measure encoding detection, not how fast read() runs.)
Each has a python/emacs coding declaration in the first line.
test.utf8 -- Tiny python program with coding declaration
and single print statement in main() function that prints
a short word (literal) in Japanese. Encoding is utf-8
(122 bytes).
test.sjis -- Identical to test.utf8 but sjis encoding
(111 bytes).
test2.utf8 -- A python coding declaration followed by
approximately 50 long lines with mixed English and
Japanese (4274 bytes).
test2.sjis -- Identical to test2.utf8 but sjis encoding
(3401 bytes).
Results:
---------------------------------------------------------
$ python3 bm.py test.utf8
test.utf8 / reopen0: total time (10000 reps) was 1.188323
test.utf8 / reopen1: total time (10000 reps) was 1.490757
test.utf8 / reopen2: total time (10000 reps) was 1.766081
test.utf8 / reopen3: total time (10000 reps) was 2.141996
$ python3 bm.py test.sjis
test.sjis / reopen0: total time (10000 reps) was 1.175914
test.sjis / reopen1: total time (10000 reps) was 1.471780
test.sjis / reopen2: total time (10000 reps) was 1.764444
test.sjis / reopen3: total time (10000 reps) was 2.122550
$ python3 bm.py test2.utf8
test2.utf8 / reopen0: total time (10000 reps) was 1.690255
test2.utf8 / reopen1: total time (10000 reps) was 1.996235
test2.utf8 / reopen2: total time (10000 reps) was 2.278798
test2.utf8 / reopen3: total time (10000 reps) was 2.727867
$ python3 bm.py test2.sjis
test2.sjis / reopen0: total time (10000 reps) was 1.841388
test2.sjis / reopen1: total time (10000 reps) was 2.147142
test2.sjis / reopen2: total time (10000 reps) was 2.426701
test2.sjis / reopen3: total time (10000 reps) was 2.873278
----------------------------------------------------------
Here is what happen when a test data file is piped
into a program using the four methods above:
$ cat test.utf8 | python3 stdin.py reopen0
read 102 characters
$ cat test.utf8 | python3 stdin.py reopen1
got exception: [Errno 29] Illegal seek
$ cat test.utf8 | python3 stdin.py reopen2
read 0 characters
$ cat test.utf8 | python3 stdin.py reopen3
read 0 characters
----
[*1] Here is the get_encoding function used above. It is
a toy simplified python source encoding line reader. Toy,
in that is looks at only one line, doesn't consider a BOM,
etc. It purpose was to allow me to sanity check the benefits
of having a callable encoding parameter.
def get_encoding (line):
if isinstance (line, bytes):
nlpos = line.index(b'\n')
mo = ENC_PATTERN_B.search (line, 0, nlpos)
if not mo: return None
enc = mo.group(1).decode ('latin1')
else:
nlpos = line.index('\n')
mo = ENC_PATTERN_S.search (line, 0, nlpos)
if not mo: return None
enc = mo.group(1)
return enc
More information about the Python-ideas
mailing list