Running queries on large data structure

Christoph Haas email at christoph-haas.de
Wed Aug 2 16:24:00 EDT 2006


Hi, list...

I have written an application in Perl some time ago (I was young and needed 
the money) that parses multiple large text files containing nested data 
structures and allows the user to run quick queries on the data.
(For the firewall admins among you: it's a parser and web-based query tool 
for CheckPoint firewall rulebases. The user can search for source and 
destination IPs and get the matching rules.)

The current application consists of two parts:

(1) An importing process that reads and parses the large text files and
    writes the data in different PostgreSQL tables.
(2) A web (CGI) interface that allows the user to query the collected
    data from the PostgreSQL database by different criteria.
    (I don't like PostgreSQL much due to the lack of decent tools like
    phpmyadmin. Pgadmin3 and Phppgadmin don't give me the feeling that
    I control the database. More the other way round. But PostgreSQL
    has a nice 'inet' data type that allows for quick matches in tables
    of IP addresses and networks.)

However the information in the (relational) database was stored in a 
horribily artificial way. The SQL query is a 20-line monster with UNIONs 
and LEFT JOINs and negations. It's lightning fast (0.5 seconds to search 
over a 500 set consisting of complex rules) but neither the source code 
nor the database is easy to handle any more. And I'd like to have more 
flexibility in the kind of queries I run. So I'd like to trade the good 
speed by some readability and a simpler - more object-oriented - data 
structure.

I'm currently thinking of different ways to handle that but would like to 
get some opinions about that:

(a) See what sqlalchemy can do for me to handle the object-relational
    transformation and basically stay with PostgreSQL.
(b) Parse the input files into one large nested Python data structure.
    Then write this structure to a file using "marshal" or "repr".
    Then I have a very clean source code like
    for rule in rules:
       for src in sources:
           if searched_for_src == src...
(c) ...?

What makes PostgreSQL less suited is the fact that CheckPoint rule bases 
can contain several complex objects:
- Hosts (easy, they are just one IP address and can easily be compared)
- Networks (nearly as easy - just see if the IP is part of the network)
- Groups (slightly harder; can even be nested and contain other groups
  and hosts or networks)
- IP ranges (10.0.0.50-10.5.25.100; not easy to parse either)

I would even like to allow the users more complex queries like multiple 
search conditions. The query would be something like "show me all matching
firewall rules where 10.0.0.5 matches the source column and 192.168.42.1
matches the destination column OR any rule where the group 'internal hosts' 
is mentioned in the destination column". It sounds like a database is the 
right job. But somehow a database is also not flexible enough. And the 
data is small enough (1 MB probably) that it can be read into memory.

What do you think would be the right tool for the job? Thanks for sharing 
your thoughts.

 Christoph



More information about the Python-list mailing list