fseek In Compressed Files
Ayushi Dalmia
ayushidalmia2604 at gmail.com
Thu Jan 30 08:34:57 EST 2014
On Thursday, January 30, 2014 4:20:26 PM UTC+5:30, Ayushi Dalmia wrote:
> Hello,
>
>
>
> I need to randomly access a bzip2 or gzip file. How can I set the offset for a line and later retreive the line from the file using the offset. Pointers in this direction will help.
This is what I have done:
import bz2
import sys
from random import randint
index={}
data=[]
f=open('temp.txt','r')
for line in f:
data.append(line)
filename='temp1.txt.bz2'
with bz2.BZ2File(filename, 'wb', compresslevel=9) as f:
f.writelines(data)
prevsize=0
list1=[]
offset={}
with bz2.BZ2File(filename, 'rb') as f:
for line in f:
words=line.strip().split(' ')
list1.append(words[0])
offset[words[0]]= prevsize
prevsize = sys.getsizeof(line)+prevsize
data=[]
count=0
with bz2.BZ2File(filename, 'rb') as f:
while count<20:
y=randint(1,25)
print y
print offset[str(y)]
count+=1
f.seek(int(offset[str(y)]))
x= f.readline()
data.append(x)
f=open('b.txt','w')
f.write(''.join(data))
f.close()
where temp.txt is the posting list file which is first written in a compressed format and then read later. I am trying to build the index for the entire wikipedia dump which needs to be done in a space and time optimised way. The temp.txt is as follows:
1 456 t0b3c0i0e0:784 t0b2c0i0e0:801 t0b2c0i0e0
2 221 t0b1c0i0e0:774 t0b1c0i0e0:801 t0b2c0i0e0
3 455 t0b7c0i0e0:456 t0b1c0i0e0:459 t0b2c0i0e0:669 t0b10c11i3e0:673 t0b1c0i0e0:678 t0b2c0i1e0:854 t0b1c0i0e0
4 410 t0b4c0i0e0:553 t0b1c0i0e0:609 t0b1c0i0e0
5 90 t0b1c0i0e0
6 727 t0b2c0i0e0
7 431 t0b2c0i1e0
8 532 t0b1c0i0e0:652 t0b1c0i0e0:727 t0b2c0i0e0
9 378 t0b1c0i0e0
10 666 t0b2c0i0e0
11 405 t0b1c0i0e0
12 702 t0b1c0i0e0
13 755 t0b1c0i0e0
14 781 t0b1c0i0e0
15 593 t0b1c0i0e0
16 725 t0b1c0i0e0
17 989 t0b2c0i1e0
18 221 t0b1c0i0e0:402 t0b1c0i0e0:842 t0b1c0i0e0
19 405 t0b1c0i0e0
20 200 t0b1c0i0e0:300 t0b1c0i0e0:398 t0b1c0i0e0:649 t0b1c0i0e0
21 66 t0b1c0i0e0
22 30 t0b1c0i0e0
23 126 t0b1c0i0e0:895 t0b1c0i0e0
24 355 t0b1c0i0e0:374 t0b1c0i0e0:378 t0b1c0i0e0:431 t0b3c0i0e0:482 t0b1c0i0e0:546 t0b3c0i0e0:578 t0b1c0i0e0
25 198 t0b1c0i0e0
More information about the Python-list
mailing list