using sqlite3 - execute vs. executemany; committing ...

Tue May 6 18:07:14 EDT 2008

Hi David,
thanks for your comments and hints, the proposed approach
with a list of dicts lookup dict is indeed much faster, than my previous
attempts with a database (even without psyco). I used a slightly different
structure with sets of indices, since they should be unique anyway and the
values are later used for the intersection.

This way the lookups (for indices given the tag values) seems to be about 20
times faster than the sqlite query (at least for my limited test case -
there might be some peculiarities with the real data);
however, the (visible) code is quite a bit more complex (for my taste);
(while looking back at
the following line in the inner loop of the lookup function:

tags_lookups[tag][item_dict[tag]] = tags_lookups[tag].get(item_dict[tag],
set()) | set([idx])

I thought, whether I am not overestimating myself with respect to the future
maintaining of the code ... :-)
I assume, that it most likely can
be written in a better way, but I tend to like the simplicity of the
sql version, as its speed is
fully acceptable too.

I will have to recheck these approaches, as soon as I have a more complete
real data available.

Thanks for reminding me about
the mxTextTools; I looked at this package very quickly several months
ago and it seemed quite
complex and heavy-weight, but maybe I will reconsider this after some
investigation ...

The
suggested XML structure is actually almost the one, I use to prepare
and control the input data
before converting it to the one presented in the previous mail :-). The main
problem is, that I can't seem to
make it fully valid XML without deforming the structure of the text
itself - it can't be easily decided,
what CUSTOM_TAG should be in some places - due to the overlapping etc.
Furthermore, the redundancy is actually greater, than it might seem from the
sample given here - there are sometimes more tags - some of them having the
same values for several dozens, sometimes hundreds, subsequent lines.
I also sometimes need to
access portions of texts spanning over multiple "tags", not just
single elements.

Thanks for your time and effort, I'll have the check the alternatives now
and test them a bit further,
  regards,
    Vlasta
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20080507/3368484d/attachment.html>