data structure suggestion (native python datatypes or sqlite; compound select)

Sat Sep 18 17:00:25 EDT 2010

2010/9/18 Dennis Lee Bieber <wlfraed at ix.netcom.com>:
> On Sat, 18 Sep 2010 10:48:44 +0200, Vlastimil Brom
> <vlastimil.brom at gmail.com> declaimed the following in
> gmane.comp.python.general:
>
>>
>> http://mail.python.org/pipermail/python-list/2008-May/540773.html
>>
>        Ah, based on that listing you are not worried about embedded tags;
> your tags all come at the start of the line (and I presume are
> terminated by the end of line). I'd thought you needed actual positions
> /in/ the line... You can drop the start/end fields and stuff the tag
> attribute into supplement (on SQLite this becomes even simpler since
> even if you define supplement to be integer, SQLite will happily store a
> text value -- a full type checking RDBM would require either making it a
> text field and storing numeric values as text, or using a pair of fields
> for numeric vs text).
>
>        Tricky part may be how you handle the display markup -- you seem to
> have a <b> </b> split over two lines... Is that significant?
>
>...

>        Of course, all the search terms can be parameterized when
> programming...
>
> cur.execute("""select t.ID, tg.supplement, t.text from texts as t
>                                        inner join tags as tg
>                                                on tg.textID = t.ID
>                        where t.text like ? and tg.type = ?""",
>                        ("%den%", "VN"))
>
> results = cur.fetchall()
>
> should return (a Python list of one tuple, in this case):
>
> [(1, "rn_1_vers_1", "<b>wi den L...n")]
> --
>        Wulfraed                 Dennis Lee Bieber         AF6VN
>        wlfraed at ix.netcom.com    HTTP://wlfraed.home.netcom.com/
>

Thank you very much for detailed hints, I see, I should have mention
the specification with my initial post...
It is true, that nested tags of the same name aren't supported, but
tags may appear anywhere in the text and aren't terminated with
newline. The tag-value association is valid from the tag position
until the next tag replacing the value or closing tag (like </b>) or
to the end of the text file. Tags beginning at line beginnings are
rather frequent, but they can appear anywhere else too.
I actually only store the metadata in the database - i.e. the
tag-value combinations for the corresponding text indices of the plain
text. The database doesn't currently contain the text itself; plain
text is used for fulltext regexp search, and it should be possible to
find the relevant tags for the matches.
I'll have a closer look on joins in sql and maybe redesign the data
structure - now the tags data are copied for each text position with
some tag change - in order to simplify queries; with multiple tables
it could be more efficient to store the tags separately and look it up
individually (probably using bisect (unless there is an SQL equivalent
?)
Well, I still have many areas to investigate in this context ...

regards,
   Vlastimil Brom