data structure suggestion (native python datatypes or sqlite; compound select)

MRAB python at mrabarnett.plus.com
Thu Sep 16 19:17:24 EDT 2010


On 16/09/2010 23:11, Vlastimil Brom wrote:
> Hi all,
> I'd like to ask for suggestions regarding suitable datastracture for
> storing textual metadata along with a plain text string.
> The input format I have is simply text with custom tags like<tag_name
> tag_value>; I'd prefer to have this human readable format the original
> data source.
> For the application, the tags are extracted and stored in a custom
> datastructure along with the  plain text.The queries should either
> return the tagset for a given text position (index) or reversely the
> text indices for a given tag-value combination..
> (I posted some more detailed remarks earlier, as I was beginning with this topic
> http://mail.python.org/pipermail/python-list/2007-December/1130275.html
> http://mail.python.org/pipermail/python-list/2008-May/1141958.html
>
> Meanwhile I managed to get working code using python native
> datastructures (nested defaultdicts and sets); however after some time
> I am now planning to adapt this code for a web-based program and am
> considering alternative datastructures, for simplicity I started with
> sqlite (which should actually suffice for the data volume and traffic
> in question).
>
> I put together some code, which works as expected, but I suspect
> somehow, that there must be better ways of doing it.
>
> Two things I am not quite clear about are using the placeholders for
> the data identifiers and "chaining" the SELECT parameters.
>
> I Couldn't find a way to use "?" placeholder for table or column
> names, hence I ended up using string interpolation for them and
> placeholders for the data values, like.
> curs.execute('SELECT * FROM "%s" WHERE "%s"==?' % (text_name,
> index_col), (text_index,))
> is there a better way or is it not supposed to supply these
> identifiers programatically?
>
You would normally expect the structure of the database to be fixed and
only the contents to vary.

> For getting the matching text indices given the tags, tag_values
> combination I ended up with a clumsy query:
>
> combined_query_list = ['SELECT "%s" FROM "%s" WHERE "%s"==?' %
> (index_col, text_name, tag) for tag in tags]
> sql_query = " INTERSECT ".join(combined_query_list)
> curs.execute(sql_query, tag_values)
>
> which produces e.g.:
> SELECT "ind" FROM "n" WHERE "KC"==? INTERSECT SELECT "ind" FROM "n"
> WHERE "VN"==?
>
> or alternatively:
>
> select_begin = 'SELECT "%s" FROM "%s" WHERE ' % (index_col, text_name)
> where_subquery = " AND ".join('"%s"==?' % (tag,) for tag in tags)
> sql_query = select_begin + where_subquery
>
> with the resulting query string like:
> SELECT "ind" FROM "n" WHERE "KC"==? AND "VN"==? ('12', '1')
>
> (BTW, are these queries equivalent, as the outputs suggest, or are
> there some distinctions to be aware of?)
>
> Anyway, I can't really believe, this would be the expected way ...
>
If you're selecting rows of a table then using 'AND' would seem the
obvious way.

> (I only marginally looked into sqlalchemy, which might simplify this a
> bit, is this true? - Currently I only use the standard lib, depending
> on the available server setup (python 2.6)).
>
> Thanks in advance for any suggestions or pointers on both the sql
> usage as well as the general datatype  question.
>
> regards,
>     Vlastimil Brom



More information about the Python-list mailing list