
Hi all -- I looked through the bug tracker, but I didn't see this listed. I was trying to use the bz2 codec, but it seems like it's not very useful in the current form (and I'm not sure if it's getting added back to py3k, so maybe this is a moot point). It looks like the codec writes every piece of data fed to it as a separate compressed block. This results in compressed files which are significantly larger than the uncompressed files, if you're writing a lot of small bursts of data. It also leads to interesing oddities like this: import codecs with codecs.open('text.bz2', 'w', 'bz2') as f: for x in xrange(20): f.write('This is data %i\n' % x) with codecs.open('text.bz2', 'r', 'bz2') as f: print f.read() This prints "This is data 0" and exits, because the codec won't read beyond the first compressed block. My question is, is this known, intended behavior? Should I open a bug report? Is it going away in py3k, so there's no real point in fixing it? -- Chris

Chris Bergstresser wrote:
The codec is scheduled to be added back to Python3. However, it's main use is in working on whole chunks of data rather than the line-by-line approach you're after. This is provided by the codec's incremental encoder/decoders, but these are currently not used by codecs.open() and I'm not sure whether the io lib uses them, which could be used via the regular open(). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 29 2010)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Wed, Sep 29, 2010 at 5:23 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Anyway, the obvious way to write line-by-line to a bz2 file is to use the BZ2File class!
The BZ2File class does not allow you to open a file for appending. Using the incremental encoder does work, which leads to the obvious question of why the codecs.open() method doesn't use the incremental method by default, at least in this case. -- Chris

On Wed, Sep 29, 2010 at 5:59 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Yes. If you open an existing bz2 file for appending and use the incremental encoder to encode the data you write to it, you end up with a single file containing two separate bz2 compressed blocks of data. The bunzip2 program handles multiple streams in a single file correctly, and there's a bug open (complete with working patch) in the Python tracker to handle them as well. -- Chris

Chris Bergstresser wrote:
The codec is scheduled to be added back to Python3. However, it's main use is in working on whole chunks of data rather than the line-by-line approach you're after. This is provided by the codec's incremental encoder/decoders, but these are currently not used by codecs.open() and I'm not sure whether the io lib uses them, which could be used via the regular open(). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 29 2010)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Wed, Sep 29, 2010 at 5:23 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Anyway, the obvious way to write line-by-line to a bz2 file is to use the BZ2File class!
The BZ2File class does not allow you to open a file for appending. Using the incremental encoder does work, which leads to the obvious question of why the codecs.open() method doesn't use the incremental method by default, at least in this case. -- Chris

On Wed, Sep 29, 2010 at 5:59 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Yes. If you open an existing bz2 file for appending and use the incremental encoder to encode the data you write to it, you end up with a single file containing two separate bz2 compressed blocks of data. The bunzip2 program handles multiple streams in a single file correctly, and there's a bug open (complete with working patch) in the Python tracker to handle them as well. -- Chris
participants (3)
-
Antoine Pitrou
-
Chris Bergstresser
-
M.-A. Lemburg