[Tutor] parse text file

Fri Jun 4 01:46:45 CEST 2010

On Fri, 4 Jun 2010 12:45:52 am Colin Talbert wrote:

> I thought when you did a for uline in input_file each single line
> would go into memory independently, not the entire file.

for line in file:

reads one line at a time, but file.read() tries to read everything in 
one go. However, it should fail with MemoryError, not just stop 
silently.

> I'm pretty sure that this is not your code, because you can't call
> len() on a bz2 file. If you try, you get an error:
>
> You are so correct.  I'd been trying numerous things to read in this
> file and had deleted the code that I meant to put here and so wrote
> this from memory incorrectly.  The code that I wrote should have
> been:
>
> import bz2
> input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
> str=input_file.read()
> len(str)
>
> Which indeed does return only 900000.

Unfortunately, I can't download your bz2 file myself to test it, but I 
think I *may* have found the problem. It looks like the current bz2 
module only supports files written as a single stream, and not multiple 
stream files. This is why the BZ2File class has no "append" mode. See 
this bug report:

http://bugs.python.org/issue1625

My hypothesis is that your bz2 file consists of either multiple streams, 
or multiple bz2 files concatenated together, and the BZ2File class 
stops reading after the first.

I can test my hypothesis:

>>> bz2.BZ2File('a.bz2', 'w').write('this is the first chunk of text')
>>> bz2.BZ2File('b.bz2', 'w').write('this is the second chunk of text')
>>> bz2.BZ2File('c.bz2', 'w').write('this is the third chunk of text')
>>> # concatenate the files
... d = file('concate.bz2', 'w')
>>> for name in "abc":
...     f = file('%c.bz2' % name, 'rb')
...     d.write(f.read())
...
>>> d.close()
>>>
>>> bz2.BZ2File('concate.bz2', 'r').read()
'this is the first chunk of text'

And sure enough, BZ2File only sees the first chunk of text!

But if I open it in a stand-alone bz2 utility (I use the Linux 
application Ark), I can see all three chunks of text. So I think we 
have a successful test of the hypothesis.

Assuming this is the problem you are having, you have a number of 
possible solutions:

(1) Re-create the bz2 file from a single stream.

(2) Use another application to expand the bz2 file and then read 
directly from that, skipping BZ2File altogether.

(3) Upgrade to Python 2.7 or 3.2, and hope the patch is applied.

(4) Backport the patch to your version of Python and apply it yourself.

(5) Write your own bz2 utility.

Not really a very appetising series of choices there, I must admit. 
Probably (1) or (2) are the least worst.

-- 
Steven D'Aprano