Need to know if a file as only ASCII charaters

Wolfgang Rohdewald wolfgang at rohdewald.de
Wed Jun 17 11:40:31 CEST 2009


On Wednesday 17 June 2009, Lie Ryan wrote:
> Wolfgang Rohdewald wrote:
> > On Wednesday, 17. June 2009, Steven D'Aprano wrote:
> >>         while text:
> >>             for c in text:
> >>                 if c not in printable: return False
> > 
> > that is one loop per character.
> 
> unless printable is a set

that would still execute the line "if c not in..." 
once for every single character, against just one
regex call. With bigger block sizes, the advantage
of regex should increase.
 
> > wouldn't it be faster to apply a regex to text?
> > something like
> > 
> > while text:
> > 	if re.search(r'\W',text): return False
> > 
> 
> regex? Don't even start...

Here comes a cProfile test. Note that the first variant of Steven
would always have stopped after the first char. After fixing that
making it look like variant 2 with block size=1, I now have 
3 variants:

Variant 1 Blocksize 1
Variant 2 Blocksize 65536
Variant 3 Regex on Blocksize 65536

testing for a file with 400k bytes shows regex as a clear winner.
Doing the same for an 8k file: variant 2 takes 3ms, Regex takes 5ms.

Variants 2 and 3 take about the same time for a file with 20k.


python ascii.py | grep CPU
         398202 function calls in 1.597 CPU seconds
         13 function calls in 0.104 CPU seconds
         1181 function calls in 0.012 CPU seconds

import re
import cProfile

from string import printable

def ascii_file1(name):
    with open(name, 'rb') as f:
        c = f.read(1)
        while c:
            if c not in printable: return False
            c = f.read(1)
        return True

def ascii_file2(name):
    bs = 65536
    with open(name, 'rb') as f:
        text = f.read(bs)
        while text:
            for c in text:
                if c not in printable: return False
            text = f.read(bs)
    return True

def ascii_file3(name):
    bs = 65536
    search = r'[^%s]' % re.escape(printable)
    reco = re.compile(search)
    with open(name, 'rb') as f:
       text = f.read(bs)
       while text:
           if reco.search(text): return False
           text = f.read(bs)
    return True

def test(fun):
    if fun('/tmp/x'):
       print 'is ascii'
    else:
       print 'is not ascii'

cProfile.run("test(ascii_file1)")
cProfile.run("test(ascii_file2)")
cProfile.run("test(ascii_file3)")




-- 
Wolfgang



More information about the Python-list mailing list