[Tutor] Reading large bz2 Files

Norman Rieß norman at smash-net.org
Fri Feb 19 19:48:34 CET 2010


Am 19.02.2010 17:04, schrieb Steven D'Aprano:
> My guess is one of two things:
> (1) You are mistaken that the file is bigger than 4311 lines.
>
> (2) You are using Windows, and somehow there is a Ctrl-Z (0x26) 
> character in the file, which Windows interprets as End Of File when 
> reading files in text mode. Try changing the mode to "rb" and see if 
> the behaviour goes away.
>   

Am 19.02.2010 17:15, schrieb Stefan Behnel:
> What does "stops" mean here? Does it crash? Does it exit from the loop? Is
> the above code exactly what you used for testing? Are you passing a
> filename? What platform is this on?
>
>
> How many lines does it have? How did you count them? Did you make sure that
> you are reading from the right file?
>
>   

Hello,

i took the liberty and copied your mails together, so i do not have to
repeat things.
How big is the file and how did i count that:

smash at loki ~/osm $ bzcat planet-100210.osm.bz2 | wc -l
1717362770
(this took a looong time ;-))
smash at loki ~/osm $ du -h planet-100210.osm.bz2
8,0G    planet-100210.osm.bz2

So as you can see, the file really is bigger.
I am not using Windows and the next character would be a period.

smash at loki ~/osm/osmcut $ ./osmcut.py ../planet-100210.osm.bz2
[...]
<changeset id="4307" created_at="2006-04-21T16:07:41Z"
closed_at="2006-04-21T17:42:48Z" open="false" min_lon="-0.0603664"
min_lat="51.6146756" max_lon="-0.0339018" max_lat="51.6451527" 
user="Steve Chilton" uid="736"/>
<changeset id="4308" created_at="2008-04-01T07:31:41Z"
closed_at="2008-04-01T08:33:05Z" open="false" min_lon="25.1998022"
min_lat="67.5300900" max_lon="25
Exiting
I used file: ../planet-100210.osm.bz2
smash at loki ~/osm $

smash at loki ~/osm $ bzcat planet-100210.osm.bz2 | grep "changeset
id=\"4308\""
  <changeset id="4308" created_at="2008-04-01T07:31:41Z"
closed_at="2008-04-01T08:33:05Z" open="false" min_lon="25.1998022"
min_lat="67.5300900" max_lon="25.3238275" max_lat="67.5653612" 
user="Kekoil" uid="19652"/>

I did set the mode to "rb" with the same result.
I also edited the code to see if the loop was exited or the program crashed.
As you can see, there is no error, the loop just exits.
This is the _exact_ code i use:

source_file = bz2.BZ2File(osm_file, "r")
    for line in source_file:
        print line.strip()
 
    print "Exiting"
    print "I used file: " + osm_file

As you can see above, the loop exits, the prints are executed and the
right file is used. The content of the file is really distinctive, so
there is no doubt, that it is the right file.
Here is my platform information:
Python 2.6.4
Linux 2.6.32.8 #1 SMP Fri Feb 12 13:29:10 CET 2010 x86_64 Intel(R)
Core(TM)2 Duo CPU U9400 @ 1.40GHz GenuineIntel GNU/Linux
Note: This symptome shows on another platform (SuSE 11.1) with different
software versions as well.

Is there a possibility, that the bz2 module reads only into a limited
buffer and no further? If so, the same behaviour of the two independent
systems would be explained and that it works in Stevens smaller example.
How could i avoid that?

Oh and the content of the file is free, so i do not get into legal
issues exposing it.

Thanks.
Regards,

Norman



More information about the Tutor mailing list