data structure suggestion (native python datatypes or sqlite; compound select)

Vlastimil Brom vlastimil.brom at gmail.com
Sat Sep 18 04:48:44 EDT 2010


2010/9/18 Dennis Lee Bieber <wlfraed at ix.netcom.com>:
> On Fri, 17 Sep 2010 10:44:43 +0200, Vlastimil Brom
> <vlastimil.brom at gmail.com> declaimed the following in
> gmane.comp.python.general:
>
>
>> Ok, thanks for confirming my suspicion :-),
>> Now I have to decide whether I shall use my custom data structure,
>> where I am on my own, or whether using an sql database in such a
>> non-standard way has some advantages...
>> The main problem indeed seems to be the fact, that I consider the
>> tagged texts to be the primary storage format, whereas the database is
>> only means for accessing it more conveniently.
>>
>
>        I suspect part of your difficulty is in trying to fit everything
> into a single relation (table).
>
>        Looking back at your ancient "format for storing textual data (for
> an edition) - formatting and additional info" post, I'd probably move
> your so-called tags into one relation -- where the tag type is, itself,
> data...
>
>        Without seeing an actual data sample (and pseudo-DDL):
>
> create table texts
>        (
>                ID autoincrement primary key,
>                text varchar
>        );
>
> create table tags
>        (
>                ID autoincrement primary key,
>                textID integer foreign key references texts(ID),
>                tagtype char,
>                start integer,
>                end integer,
>                supplement varchar
>        );
>
>        I'd really have to see samples (more than one line) of the raw
> input, and the desired information...
>
> --
>        Wulfraed                 Dennis Lee Bieber         AF6VN
>        wlfraed at ix.netcom.com    HTTP://wlfraed.home.netcom.com/
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Thanks for the elaboration, I am sure, I am missing some more advanced
features of SQL (would the above also work in sqlite, as there
(probably?) are no real type restrictions on data?
The "markup" format as well as the requirements haven't change since
those old posts, one sample of the tagged text is in one of the
follow-up post of that:

http://mail.python.org/pipermail/python-list/2008-May/540773.html

in principle in the tags are in the form <tag_name some tag value>,
from that text index on this tag-value combination is assigned - until
<tag_name another value> or </tag_name> arbitrary combinations of the
tags including overlapping are possible (nesting of the same tags is
not possible in favor of the direct replacement).
Different texts may have (partly) differing tags, which I'd prefer to
handle generally, without having to adapt the queries directly.

After the tagged text is parsed, the plain text and the corresponding
"database" are created, which maps the text indices to the tag names
with their values.
Querying the data should be able to get the "tagset" for the given
text index and conversely to find the indices matching the given
tag-value combinations.

(actually the text ranges matching those criteria would be even
better, but these are easily done with bisect)
(from the specification, mxTextTools looks similar, but it seemed
rather low-level and quite heavyweight for the given task)

Thanks in advance for any suggestions,
                    Vlastimil Brom



More information about the Python-list mailing list