parsing a file for analysis
rmorgan466 at gmail.com
Sat Feb 26 16:58:08 CET 2011
Thanks Andrea. I was thinking that too but I was wondering if there were any
other clever ways of doing this.
I also though, I can build a filesystem structure depending on the __time.
So, for January 01, 2011. I would create /tmp/data/20110101/data . This way
I can have a fast index of the data. And next time I read thru this file, I
can skip all of Jan 01, 2011
On Sat, Feb 26, 2011 at 10:29 AM, Andrea Crotti
<andrea.crotti.0 at gmail.com>wrote:
> Il giorno 26/feb/2011, alle ore 06.45, Rita ha scritto:
> > I have a large text (4GB) which I am parsing.
> > I am reading the file to collect stats on certain items.
> > My approach has been simple,
> > for row in open(file):
> > if "INFO" in row:
> > line=row.split()
> > user=line
> > host=line
> > __time=line
> > ...
> > I was wondering if there is a framework or a better algorithm to read
> such as large file and collect it stats according to content. Also, are
> there any libraries, data structures or functions which can be helpful? I
> was told about 'collections' container. Here are some stats I am trying to
> > *Number of unique users
> > *Break down each user's visit according to time, t0 to t1
> > *what user came from what host.
> > *what time had the most users?
> > (There are about 15 different things I want to query)
> > I understand most of these are redundant but it would be nice to have a
> framework or even a object oriented way of doing this instead of loading it
> into a database.
> > Any thoughts or ideas?
> Not an expert, but maybe it might be good to push the data into a database,
> and then you can tweak the DBMS and write
> smart queries to get all the statistics you want from it.
> It might take a while (maybe with regexp splitting is faster) but it's done
> only once and then you work with DB tools.
--- Get your facts first, then you can distort them as you please.--
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-list