os walk() and threads problems (os.walk are thread safe?)
Marcus Alves Grando
marcus at sbh.eng.br
Tue Nov 13 15:55:31 EST 2007
I make one new version more equally to original version:
--code--
#!/usr/bin/python
import os, sys, time
import glob, random, Queue
import threading
EXIT = False
BRANDS = {}
LOCK=threading.Lock()
EV=threading.Event()
POOL=Queue.Queue(0)
NRO_THREADS=20
def walkerr(err):
print err
class Worker(threading.Thread):
def run(self):
EV.wait()
while True:
try:
mydir=POOL.get(timeout=1)
if mydir == None:
continue
for root, dirs, files in os.walk(mydir, onerror=walkerr):
if EXIT:
break
terra_user = 'test'
terra_brand = 'test'
user_du = '0 a'
user_total_files = 0
LOCK.acquire()
if not BRANDS.has_key(terra_brand):
BRANDS[terra_brand] = {}
BRANDS[terra_brand]['COUNT'] = 1
BRANDS[terra_brand]['SIZE'] = int(user_du.split()[0])
BRANDS[terra_brand]['FILES'] = user_total_files
else:
BRANDS[terra_brand]['COUNT'] = BRANDS[terra_brand]['COUNT'] + 1
BRANDS[terra_brand]['SIZE'] = BRANDS[terra_brand]['SIZE'] +
int(user_du.split()[0])
BRANDS[terra_brand]['FILES'] = BRANDS[terra_brand]['FILES'] +
user_total_files
LOCK.release()
except Queue.Empty:
if EXIT:
break
else:
continue
except KeyboardInterrupt:
break
except Exception:
print mydir
raise
if len(sys.argv) < 2:
print 'Usage: %s dir...' % sys.argv[0]
sys.exit(1)
glob_dirs = []
for i in sys.argv[1:]:
glob_dirs = glob_dirs + glob.glob(i+'/[a-z_]*')
random.shuffle(glob_dirs)
for x in xrange(NRO_THREADS):
Worker().start()
try:
for i in glob_dirs:
POOL.put(i)
EV.set()
while not POOL.empty():
time.sleep(1)
EXIT = True
while (threading.activeCount() > 1):
time.sleep(1)
except KeyboardInterrupt:
EXIT=True
for b in BRANDS:
print '%s:%i:%i:%i' % (b, BRANDS[b]['SIZE'], BRANDS[b]['COUNT'],
BRANDS[b]['FILES'])
--code--
And run in make servers:
# uname -r
2.6.18-8.1.15.el5
# python test.py /usr
test:0:2267:0
# python test.py /usr
test:0:2224:0
# python test.py /usr
test:0:2380:0
# python -V
Python 2.4.3
# uname -r
7.0-BETA2
# python test.py /usr
test:0:1706:0
# python test.py /usr
test:0:1492:0
# python test.py /usr
test:0:1524:0
# python -V
Python 2.5.1
# uname -r
2.6.9-42.0.8.ELsmp
# python test.py /usr
test:0:1311:0
# python test.py /usr
test:0:1486:0
# python test.py /usr
test:0:1520:0
# python -V
Python 2.3.4
I really don't know what's happen.
Another ideia?
Regards
Chris Mellon wrote:
> On Nov 13, 2007 1:06 PM, Marcus Alves Grando <marcus at sbh.eng.br> wrote:
>> Diez B. Roggisch wrote:
>>> Marcus Alves Grando wrote:
>>>
>>>> Diez B. Roggisch wrote:
>>>>> Marcus Alves Grando wrote:
>>>>>
>>>>>> Hello list,
>>>>>>
>>>>>> I have a strange problem with os.walk and threads in python script. I
>>>>>> have one script that create some threads and consume Queue. For every
>>>>>> value in Queue this script run os.walk() and printing root dir. But if i
>>>>>> increase number of threads the result are inconsistent compared with one
>>>>>> thread.
>>>>>>
>>>>>> For example, run this code plus sort with one thread and after run again
>>>>>> with ten threads and see diff(1).
>>>>> I don't see any difference. I ran it with 1 and 10 workers + sorted the
>>>>> output. No diff whatsoever.
>>>> Do you test in one dir with many subdirs? like /usr or /usr/ports (in
>>>> freebsd) for example?
>>> Yes, over 1000 subdirs/files.
>> Strange, because to me accurs every time.
>>
>>>>> And I don't know what you mean by diff(1) - was that supposed to be some
>>>>> output?
>>>> No. One thread produce one result and ten threads produce another result
>>>> with less lines.
>>>>
>>>> Se example below:
>>>>
>>>> @@ -13774,8 +13782,6 @@
>>>> /usr/compat/linux/proc/44
>>>> /usr/compat/linux/proc/45
>>>> /usr/compat/linux/proc/45318
>>>> -/usr/compat/linux/proc/45484
>>>> -/usr/compat/linux/proc/45532
>>>> /usr/compat/linux/proc/45857
>>>> /usr/compat/linux/proc/45903
>>>> /usr/compat/linux/proc/46
>>> I'm not sure what that directory is, but to me that looks like the
>>> linux /proc dir, containing process ids. Which incidentially changes
>>> between the two runs, as more threads will have process id aliases.
>> My example are not good enough. I run this script in ports directory of
>> freebsd and imap folders in my linux server, same thing.
>>
>> @@ -182,7 +220,6 @@
>> /usr/ports/archivers/p5-POE-Filter-Bzip2
>> /usr/ports/archivers/p5-POE-Filter-LZF
>> /usr/ports/archivers/p5-POE-Filter-LZO
>> -/usr/ports/archivers/p5-POE-Filter-LZW
>> /usr/ports/archivers/p5-POE-Filter-Zlib
>> /usr/ports/archivers/p5-PerlIO-gzip
>> /usr/ports/archivers/p5-PerlIO-via-Bzip2
>> @@ -234,7 +271,6 @@
>> /usr/ports/archivers/star-devel
>> /usr/ports/archivers/star-devel/files
>> /usr/ports/archivers/star/files
>> -/usr/ports/archivers/stuffit
>> /usr/ports/archivers/szip
>> /usr/ports/archivers/tardy
>> /usr/ports/archivers/tardy/files
>>
>>
>
> Are you just diffing the output? There's no guarantee that
> os.path.walk() will always have the same order, or that your different
> working threads will produce the same output in the same order. On my
> system, for example, I get a different order of subdirectory output
> when I run with 10 threads than with 1.
>
> walk() requires that stat() works for the next directory that will be
> walked. It might be remotely possible that stat() is failing for some
> reason and some directories are being lost (this is probably not going
> to be reproducible). If you can reproduce it, trying using pdb to see
> what's going on inside walk().
--
Marcus Alves Grando
marcus(at)sbh.eng.br | Personal
mnag(at)FreeBSD.org | FreeBSD.org
More information about the Python-list
mailing list