parsing a file for analysis
Martin Gregorie
martin at address-in-sig.invalid
Sat Feb 26 10:53:51 EST 2011
On Sat, 26 Feb 2011 16:29:54 +0100, Andrea Crotti wrote:
> Il giorno 26/feb/2011, alle ore 06.45, Rita ha scritto:
>
>> I have a large text (4GB) which I am parsing.
>>
>> I am reading the file to collect stats on certain items.
>>
>> My approach has been simple,
>>
>> for row in open(file):
>> if "INFO" in row:
>> line=row.split()
>> user=line[0]
>> host=line[1]
>> __time=line[2]
>> ...
>>
>> I was wondering if there is a framework or a better algorithm to read
>> such as large file and collect it stats according to content. Also, are
>> there any libraries, data structures or functions which can be helpful?
>> I was told about 'collections' container. Here are some stats I am
>> trying to get:
>>
>> *Number of unique users
>> *Break down each user's visit according to time, t0 to t1 *what user
>> came from what host.
>> *what time had the most users?
>>
>> (There are about 15 different things I want to query)
>>
>> I understand most of these are redundant but it would be nice to have a
>> framework or even a object oriented way of doing this instead of
>> loading it into a database.
>>
>>
>> Any thoughts or ideas?
>
> Not an expert, but maybe it might be good to push the data into a
> database, and then you can tweak the DBMS and write smart queries to get
> all the statistics you want from it.
>
> It might take a while (maybe with regexp splitting is faster) but it's
> done only once and then you work with DB tools.
>
This is the sort of job that is best done with awk.
Awk processes a text file line by line, automatically splitting each line
into an array of words. It uses regexes to recognise lines and trigger
actions on them. For example, building a list of visitors: assume there's
a line containing "username logged on", you could build a list of users
and count their visits with this statement:
/logged on/ { user[$1] += 1 }
where the regex, /logged on/, triggers the action, in curly brackets, for
each line it matches. "$1" is a symbol for the first word in the line.
--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |
More information about the Python-list
mailing list