Yes, Yes :-). I was using awk to do all of this.  It does work but I find myself repeating reading the same data because awk does not support complex data structures. Plus the code is getting ugly. <br><br>I was told about Orange (<a href="http://orange.biolab.si/">http://orange.biolab.si/</a>). Does anyone have experience with it?<br>

<br><br><br><div class="gmail_quote">On Sat, Feb 26, 2011 at 10:53 AM, Martin Gregorie <span dir="ltr"><martin@address-in-sig.invalid></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="im">On Sat, 26 Feb 2011 16:29:54 +0100, Andrea Crotti wrote:<br>

<br>

> Il giorno 26/feb/2011, alle ore 06.45, Rita ha scritto:<br>

><br>

</div><div><div></div><div class="h5">>> I have a large text (4GB) which I am parsing.<br>

>><br>

>> I am reading the file to collect stats on certain items.<br>

>><br>

>> My approach has been simple,<br>

>><br>

>> for row in open(file):<br>

>>   if "INFO" in row:<br>

>>     line=row.split()<br>

>>     user=line[0]<br>

>>     host=line[1]<br>

>>     __time=line[2]<br>

>>     ...<br>

>><br>

>> I was wondering if there is a framework or a better algorithm to read<br>

>> such as large file and collect it stats according to content. Also, are<br>

>> there any libraries, data structures or functions which can be helpful?<br>

>> I was told about 'collections' container.  Here are some stats I am<br>

>> trying to get:<br>

>><br>

>> *Number of unique users<br>

>> *Break down each user's visit according to time, t0 to t1 *what user<br>

>> came from what host.<br>

>> *what time had the most users?<br>

>><br>

>> (There are about 15 different things I want to query)<br>

>><br>

>> I understand most of these are redundant but it would be nice to have a<br>

>> framework or even a object oriented way of doing this instead of<br>

>> loading it into a database.<br>

>><br>

>><br>

>> Any thoughts or ideas?<br>

><br>

</div></div><div class="im">> Not an expert, but maybe it might be good to push the data into a<br>

> database, and then you can tweak the DBMS and write smart queries to get<br>

> all the statistics you want from it.<br>

><br>

> It might take a while (maybe with regexp splitting is faster) but it's<br>

> done only once and then you work with DB tools.<br>

><br>

</div>This is the sort of job that is best done with awk.<br>

<br>

Awk processes a text file line by line, automatically splitting each line<br>

into an array of words. It uses regexes to recognise lines and trigger<br>

actions on them. For example, building a list of visitors: assume there's<br>

a line containing "username logged on", you could build a list of users<br>

and count their visits with this statement:<br>

<br>

/logged on/ { user[$1] += 1 }<br>

<br>

where the regex, /logged on/, triggers the action, in curly brackets, for<br>

each line it matches. "$1" is a symbol for the first word in the line.<br>

<br>

<br>

--<br>

martin@   | Martin Gregorie<br>

gregorie. | Essex, UK<br>

org       |<br>

<font color="#888888">--<br>

<a href="http://mail.python.org/mailman/listinfo/python-list" target="_blank">http://mail.python.org/mailman/listinfo/python-list</a><br>

</font></blockquote></div><br><br clear="all"><br>-- <br>--- <span>Get your facts first, then you can distort them as you please.</span>--<br>