programming unlimited "categories" in python ?

Terry Reedy tjreedy at home.com
Mon Oct 22 19:07:47 EDT 2001


"Stephen" <shriek at gmx.co.uk> wrote in message
news:97ae44ee.0110221318.6eec382d at posting.google.com...
> Scratching my head over what I thought was a simple problem.
>
> I'm developing a catalog application (storing its data in a
> relational database), where catalog entries are categorized
> by one or more categories/subcategories.  It would be nice to
> have "unlimited" levels of subcategories and that seemed
> simple enough ~ "Use a parent/child" I hear you say ~ and that's
> what I did but I've since found it's not very flexible further
> down the line.

This is a database rather than Python question, but since you asked,
I'll answer from my database experience.

> Let me demonstrate with some example categories/subcategories
> which we place in a category tree, giving each node a unique ID.
> The root node is deemed to have node ID zero (0).
>
> Root -- 1. ByLocation --- 3. Africa   --- 4.  Mozambique
>                                       --- 6.  SouthAfrica
>                       --- 5. Europe   --- 9.  Portugal
>      -- 2. BySeverity --- 7. Critical --- 8.  Death
>                                       --- 11. Handicap
>                       --- 10. Moderate---

Location and Severity are two separate variables that should be coded
completely separately from each other.  Severity, if not numeric, is
usually an ordered series of categories, such as 1=mild, 2=moderate,
3=handicap, 4=critical, 5=death.  Coding three-level LOcation is more
complicated.  Either
1. Split it into three variables - continent, country, city, with
continent required and country and city optional -or-
2. Encoded the levels into 'one' entry (which you split into pieces as
needed).  You could try a separater-ed entry such as
continent-country-city (like internet address, windows registry, etc)
or positional-subfield encoded (like many product ids and library book
categorizers (as in either Dewey Decimal or Library of Congress
systems)).  For instance, 0 = Africa, 10 = Mozanbique, 11=(capital
city), 20= Angola, 1000 = Europe, etc.  In otherwords, 1000s =
continent, hundreds/tens = country, ones position = city.

> This structure can be stored in a single table of a database.
>
> Parent_ID    Category_ID     Category_Label
> 0            1               ByLocation
> 0            2               BySeverity
> 1            3               Africa
> 3            4               Mozambique
> 1            5               Europe
> 3            6               SouthAfrica
> 2            7               Critical
> 7            8               Death
> 5            9               Portugal
> etc

Mixing two varriables like this is a mistake.

> So far so good.   Cataloged items/illnesses record their
> categories in a one-to-many table.  For example, an illness
> with categories  "4" and "8" occurs in Mozambique and can
> result in death.
>
> This appears scalable.
>
> Likewise you can easily select all illnesses occuring in
> Portugal using a JOIN and filtering category ID "9".
>
> So what's the problem ?
>
> The problem arises if one asks "Which illnesses occur
> in Africa ?".  First, one has to find all category IDs
> for which this is true. This may be easy for the example
> above but imagine if the category ("Africa") has a sub-
> -category for each and every country then further
> subcategorized by major cities. To find all possible
> categories, one would have to do a recursive select
> finding each subscategory with the appropriate "parent ID".
> This does not seem like a very efficient method.

If location were coded with continent either a separate variable/field
or separate subfield, this query would be trivial, as it should be.

Terry J. Reedy






More information about the Python-list mailing list