Re: [Python-ideas] TextIOWrapper callable encoding parameter

As a followup, here are some timing data that seem to confirm a modest increase in speed as a result of implementing the callable encoding parameter I proposed (although that would not be the main reason for wanting to do it.) These are just for illustration. (Among many other reasons, _pyio benchmarks are not very useful.) I read four short test files using four methods for determining the test file's encoding. The test files are a simplified model of a python coding declaration (always on first line in our case with no BOM present [*1]) followed by mixed english and japanese text. Method 0 (reopen0): Use the encoding callable I am proposing. def reopen0 (fname): def hook (data,buf): return get_encoding (data) t = io.open (fname, encoding=hook) Method 1 (reopen1): Open in binary to determine encoding, then rewrap in a TextIOWrapper with the correct encoding. def reopen1 (fname): b = io.open (fname, 'rb') line = b.readline() enc = get_encoding (line) b.seek (0) t = io.TextIOWrapper (b, enc, line_buffering=True) t.mode = 'r' Method 2 (reopen2): Open in binary to determine encoding, then reopen in text mode with correct encoding. def reopen2 (fname): b = io.open (fname, 'rb') line = b.readline() enc = get_encoding (line) t = io.open (fname, encoding=enc) Method 3 (reopen3): Open in text mode (latin1) to determine encoding, then reopen in text mode with correct encoding. def reopen3 (fname): f = io.open (fname, encoding='latin1') line = f.readline() enc = get_encoding (line) t = io.open (fname, encoding=enc) The same get_encoding() function is used in all methods [*1]. The input test data are all small files (because we want to measure encoding detection, not how fast read() runs.) Each has a python/emacs coding declaration in the first line. test.utf8 -- Tiny python program with coding declaration and single print statement in main() function that prints a short word (literal) in Japanese. Encoding is utf-8 (122 bytes). test.sjis -- Identical to test.utf8 but sjis encoding (111 bytes). test2.utf8 -- A python coding declaration followed by approximately 50 long lines with mixed English and Japanese (4274 bytes). test2.sjis -- Identical to test2.utf8 but sjis encoding (3401 bytes). Results: --------------------------------------------------------- $ python3 bm.py test.utf8 test.utf8 / reopen0: total time (10000 reps) was 1.188323 test.utf8 / reopen1: total time (10000 reps) was 1.490757 test.utf8 / reopen2: total time (10000 reps) was 1.766081 test.utf8 / reopen3: total time (10000 reps) was 2.141996 $ python3 bm.py test.sjis test.sjis / reopen0: total time (10000 reps) was 1.175914 test.sjis / reopen1: total time (10000 reps) was 1.471780 test.sjis / reopen2: total time (10000 reps) was 1.764444 test.sjis / reopen3: total time (10000 reps) was 2.122550 $ python3 bm.py test2.utf8 test2.utf8 / reopen0: total time (10000 reps) was 1.690255 test2.utf8 / reopen1: total time (10000 reps) was 1.996235 test2.utf8 / reopen2: total time (10000 reps) was 2.278798 test2.utf8 / reopen3: total time (10000 reps) was 2.727867 $ python3 bm.py test2.sjis test2.sjis / reopen0: total time (10000 reps) was 1.841388 test2.sjis / reopen1: total time (10000 reps) was 2.147142 test2.sjis / reopen2: total time (10000 reps) was 2.426701 test2.sjis / reopen3: total time (10000 reps) was 2.873278 ---------------------------------------------------------- Here is what happen when a test data file is piped into a program using the four methods above: $ cat test.utf8 | python3 stdin.py reopen0 read 102 characters $ cat test.utf8 | python3 stdin.py reopen1 got exception: [Errno 29] Illegal seek $ cat test.utf8 | python3 stdin.py reopen2 read 0 characters $ cat test.utf8 | python3 stdin.py reopen3 read 0 characters ---- [*1] Here is the get_encoding function used above. It is a toy simplified python source encoding line reader. Toy, in that is looks at only one line, doesn't consider a BOM, etc. It purpose was to allow me to sanity check the benefits of having a callable encoding parameter. def get_encoding (line): if isinstance (line, bytes): nlpos = line.index(b'\n') mo = ENC_PATTERN_B.search (line, 0, nlpos) if not mo: return None enc = mo.group(1).decode ('latin1') else: nlpos = line.index('\n') mo = ENC_PATTERN_S.search (line, 0, nlpos) if not mo: return None enc = mo.group(1) return enc
participants (1)
-
Rurpy