[Tutor] parse text file
Steven D'Aprano
steve at pearwood.info
Fri Jun 4 01:46:45 CEST 2010
On Fri, 4 Jun 2010 12:45:52 am Colin Talbert wrote:
> I thought when you did a for uline in input_file each single line
> would go into memory independently, not the entire file.
for line in file:
reads one line at a time, but file.read() tries to read everything in
one go. However, it should fail with MemoryError, not just stop
silently.
> I'm pretty sure that this is not your code, because you can't call
> len() on a bz2 file. If you try, you get an error:
>
> You are so correct. I'd been trying numerous things to read in this
> file and had deleted the code that I meant to put here and so wrote
> this from memory incorrectly. The code that I wrote should have
> been:
>
> import bz2
> input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
> str=input_file.read()
> len(str)
>
> Which indeed does return only 900000.
Unfortunately, I can't download your bz2 file myself to test it, but I
think I *may* have found the problem. It looks like the current bz2
module only supports files written as a single stream, and not multiple
stream files. This is why the BZ2File class has no "append" mode. See
this bug report:
http://bugs.python.org/issue1625
My hypothesis is that your bz2 file consists of either multiple streams,
or multiple bz2 files concatenated together, and the BZ2File class
stops reading after the first.
I can test my hypothesis:
>>> bz2.BZ2File('a.bz2', 'w').write('this is the first chunk of text')
>>> bz2.BZ2File('b.bz2', 'w').write('this is the second chunk of text')
>>> bz2.BZ2File('c.bz2', 'w').write('this is the third chunk of text')
>>> # concatenate the files
... d = file('concate.bz2', 'w')
>>> for name in "abc":
... f = file('%c.bz2' % name, 'rb')
... d.write(f.read())
...
>>> d.close()
>>>
>>> bz2.BZ2File('concate.bz2', 'r').read()
'this is the first chunk of text'
And sure enough, BZ2File only sees the first chunk of text!
But if I open it in a stand-alone bz2 utility (I use the Linux
application Ark), I can see all three chunks of text. So I think we
have a successful test of the hypothesis.
Assuming this is the problem you are having, you have a number of
possible solutions:
(1) Re-create the bz2 file from a single stream.
(2) Use another application to expand the bz2 file and then read
directly from that, skipping BZ2File altogether.
(3) Upgrade to Python 2.7 or 3.2, and hope the patch is applied.
(4) Backport the patch to your version of Python and apply it yourself.
(5) Write your own bz2 utility.
Not really a very appetising series of choices there, I must admit.
Probably (1) or (2) are the least worst.
--
Steven D'Aprano
More information about the Tutor
mailing list