[Python-ideas] TextIOWrapper callable encoding parameter

Mon Jun 11 17:06:18 CEST 2012

As a followup, here are some timing data that seem to confirm
a modest increase in speed as a result of implementing the
callable encoding parameter I proposed (although that would 
not be the main reason for wanting to do it.)  These are just
for illustration.  (Among many other reasons, _pyio benchmarks
are not very useful.)

I read four short test files using four methods for determining 
the test file's encoding.  The test files are a simplified model 
of a python coding declaration (always on first line in our case 
with no BOM present [*1]) followed by mixed english and japanese 
text.

Method 0 (reopen0): 
Use the encoding callable I am proposing.

   def reopen0 (fname):
        def hook (data,buf):
            return get_encoding (data)
        t = io.open (fname, encoding=hook)

Method 1 (reopen1):
Open in binary to determine encoding, then rewrap in a 
TextIOWrapper with the correct encoding.

    def reopen1 (fname):
        b = io.open (fname, 'rb')
        line = b.readline()
        enc = get_encoding (line)
        b.seek (0)
        t = io.TextIOWrapper (b, enc, line_buffering=True)
        t.mode = 'r'

Method 2 (reopen2):
Open in binary to determine encoding, then reopen in text mode
with correct encoding.

    def reopen2 (fname):
        b = io.open (fname, 'rb')
        line = b.readline()
        enc = get_encoding (line)
        t = io.open (fname, encoding=enc)

Method 3 (reopen3):
Open in text mode (latin1) to determine encoding, then reopen
in text mode with correct encoding.

    def reopen3 (fname):
        f = io.open (fname, encoding='latin1')
        line = f.readline()
        enc = get_encoding (line)
        t = io.open (fname, encoding=enc)

The same get_encoding() function is used in all methods [*1].

The input test data are all small files (because we want
to measure encoding detection, not how fast read() runs.)
Each has a python/emacs coding declaration in the first line.

test.utf8 -- Tiny python program with coding declaration 
  and single print statement in main() function that prints
  a short word (literal) in Japanese.  Encoding is utf-8
  (122 bytes).
test.sjis -- Identical to test.utf8 but sjis encoding
  (111 bytes).
test2.utf8 -- A python coding declaration followed by 
  approximately 50 long lines with mixed English and
 Japanese (4274 bytes).
test2.sjis -- Identical to test2.utf8 but sjis encoding
 (3401 bytes).

Results:
---------------------------------------------------------
$ python3 bm.py test.utf8
test.utf8 / reopen0: total time (10000 reps) was 1.188323
test.utf8 / reopen1: total time (10000 reps) was 1.490757
test.utf8 / reopen2: total time (10000 reps) was 1.766081
test.utf8 / reopen3: total time (10000 reps) was 2.141996
$ python3 bm.py test.sjis
test.sjis / reopen0: total time (10000 reps) was 1.175914
test.sjis / reopen1: total time (10000 reps) was 1.471780
test.sjis / reopen2: total time (10000 reps) was 1.764444
test.sjis / reopen3: total time (10000 reps) was 2.122550
$ python3 bm.py test2.utf8
test2.utf8 / reopen0: total time (10000 reps) was 1.690255
test2.utf8 / reopen1: total time (10000 reps) was 1.996235
test2.utf8 / reopen2: total time (10000 reps) was 2.278798
test2.utf8 / reopen3: total time (10000 reps) was 2.727867
$ python3 bm.py test2.sjis
test2.sjis / reopen0: total time (10000 reps) was 1.841388
test2.sjis / reopen1: total time (10000 reps) was 2.147142
test2.sjis / reopen2: total time (10000 reps) was 2.426701
test2.sjis / reopen3: total time (10000 reps) was 2.873278
----------------------------------------------------------

Here is what happen when a test data file is piped 
into a program using the four methods above:

  $ cat test.utf8 | python3 stdin.py reopen0
  read 102 characters

  $ cat test.utf8 | python3 stdin.py reopen1
  got exception: [Errno 29] Illegal seek

  $ cat test.utf8 | python3 stdin.py reopen2
  read 0 characters

  $ cat test.utf8 | python3 stdin.py reopen3
  read 0 characters

----
[*1] Here is the get_encoding function used above.  It is 
a toy simplified python source encoding line reader.  Toy,
in that is looks at only one line, doesn't consider a BOM,
etc.  It purpose was to allow me to sanity check the benefits
of having a callable encoding parameter.

    def get_encoding (line):
        if isinstance (line, bytes):
            nlpos = line.index(b'\n')
            mo = ENC_PATTERN_B.search (line, 0, nlpos)
            if not mo: return None
            enc = mo.group(1).decode ('latin1')
        else:
            nlpos = line.index('\n')
            mo = ENC_PATTERN_S.search (line, 0, nlpos)
            if not mo: return None
            enc = mo.group(1)
        return enc