Data structure for plotting monotonically expanding data set
Edmondo Giovannozzi
edmondo.giovannozzi at gmail.com
Thu May 27 11:55:11 EDT 2021
Il giorno giovedì 27 maggio 2021 alle 11:28:31 UTC+2 Loris Bennett ha scritto:
> Hi,
>
> I currently a have around 3 years' worth of files like
>
> home.20210527
> home.20210526
> home.20210525
> ...
>
> so around 1000 files, each of which contains information about data
> usage in lines like
>
> name kb
> alice 123
> bob 4
> ...
> zebedee 9999999
>
> (there are actually more columns). I have about 400 users and the
> individual files are around 70 KB in size.
>
> Once a month I want to plot the historical usage as a line graph for the
> whole period for which I have data for each user.
>
> I already have some code to extract the current usage for a single from
> the most recent file:
>
> for line in open(file, "r"):
> columns = line.split()
> if len(columns) < data_column:
> logging.debug("no. of cols.: %i less than data col", len(columns))
> continue
> regex = re.compile(user)
> if regex.match(columns[user_column]):
> usage = columns[data_column]
> logging.info(usage)
> return usage
> logging.error("unable to find %s in %s", user, file)
> return "none"
>
> Obviously I will want to extract all the data for all users from a file
> once I have opened it. After looping over all files I would naively end
> up with, say, a nested dict like
>
> {"20210527": { "alice" : 123, , ..., "zebedee": 9999999},
> "20210526": { "alice" : 123, "bob" : 3, ..., "zebedee": 9},
> "20210525": { "alice" : 123, "bob" : 1, ..., "zebedee": 9999999},
> "20210524": { "alice" : 123, ..., "zebedee": 9},
> "20210523": { "alice" : 123, ..., "zebedee": 9999999},
> ...}
>
> where the user keys would vary over time as accounts, such as 'bob', are
> added and latter deleted.
>
> Is creating a potentially rather large structure like this the best way
> to go (I obviously could limit the size by, say, only considering the
> last 5 years)? Or is there some better approach for this kind of
> problem? For plotting I would probably use matplotlib.
>
> Cheers,
>
> Loris
>
> --
> This signature is currently under construction.
Have you tried to use pandas to read the data?
Then you may try to add a column with the date and then join the datasets.
More information about the Python-list
mailing list