Need to know if a file as only ASCII charaters
Wolfgang Rohdewald
wolfgang at rohdewald.de
Wed Jun 17 05:40:31 EDT 2009
On Wednesday 17 June 2009, Lie Ryan wrote:
> Wolfgang Rohdewald wrote:
> > On Wednesday, 17. June 2009, Steven D'Aprano wrote:
> >> while text:
> >> for c in text:
> >> if c not in printable: return False
> >
> > that is one loop per character.
>
> unless printable is a set
that would still execute the line "if c not in..."
once for every single character, against just one
regex call. With bigger block sizes, the advantage
of regex should increase.
> > wouldn't it be faster to apply a regex to text?
> > something like
> >
> > while text:
> > if re.search(r'\W',text): return False
> >
>
> regex? Don't even start...
Here comes a cProfile test. Note that the first variant of Steven
would always have stopped after the first char. After fixing that
making it look like variant 2 with block size=1, I now have
3 variants:
Variant 1 Blocksize 1
Variant 2 Blocksize 65536
Variant 3 Regex on Blocksize 65536
testing for a file with 400k bytes shows regex as a clear winner.
Doing the same for an 8k file: variant 2 takes 3ms, Regex takes 5ms.
Variants 2 and 3 take about the same time for a file with 20k.
python ascii.py | grep CPU
398202 function calls in 1.597 CPU seconds
13 function calls in 0.104 CPU seconds
1181 function calls in 0.012 CPU seconds
import re
import cProfile
from string import printable
def ascii_file1(name):
with open(name, 'rb') as f:
c = f.read(1)
while c:
if c not in printable: return False
c = f.read(1)
return True
def ascii_file2(name):
bs = 65536
with open(name, 'rb') as f:
text = f.read(bs)
while text:
for c in text:
if c not in printable: return False
text = f.read(bs)
return True
def ascii_file3(name):
bs = 65536
search = r'[^%s]' % re.escape(printable)
reco = re.compile(search)
with open(name, 'rb') as f:
text = f.read(bs)
while text:
if reco.search(text): return False
text = f.read(bs)
return True
def test(fun):
if fun('/tmp/x'):
print 'is ascii'
else:
print 'is not ascii'
cProfile.run("test(ascii_file1)")
cProfile.run("test(ascii_file2)")
cProfile.run("test(ascii_file3)")
--
Wolfgang
More information about the Python-list
mailing list