On Jun 14, 2013, at 11:18 PM, Christopher Armstrong <radix@twistedmatrix.com> wrote:

This conversation has gotten pretty sprawling; time to reel it in with
some code.

What do you think of this for an API that meets in the middle?

https://gist.github.com/radeex/5787124

This example implementation only concerns itself with the points under
debate right now; obviously it's completely unusable in general. But
anyway:

1. it supports English logs
2. it doesn't require you to specify a formatting if you want to just
log a bunch of data
3. it makes it easy to specify a system (both manually and based on
the module your code is in)

First off, thanks again for taking the time to correspond about this, I'm very happy for this conversation to be attracting attention and making progress :).

This code is relying on the semantics of the existing log.msg() call (which are bad, for the reasons I enumerated in the start of this thread), right down to the dependence on %-formatting. The main thing I'm interested in at the moment is actually fixing those semantics to be more useful; the specific thing we're talking about here, persistent log identifiers, is interesting to me at the moment mostly as it relates to that.

I realize you might just be relying on it for familiarity, you might not have had time to read the whole new thing, so I'm not trying to tear down your idea based on that alone :).

But Wilfredo already implemented a thing which addresses the namespacing-of-log-messaages issue, and it does about 2/3 of what your proposal already does (except it has, like, tests and stuff). Right down to the _getframe() hack (although <http://trac.calendarserver.org/browser/CalendarServer/trunk/twext/python/log.py#L312> uses inspect.currentframe() which has a public-sounding name at least :)). As I mentioned in my immediately-previous reply, the logger's 'namespace' does not need to be the same as your module name, so it can be a "more persistent" identifier if you have one.

More significantly, you're actually going to the trouble of constructing a format string. This produces output in an ad-hoc format: '='-separated str()s of your log keys (although only the ones you pass directly in your message; those annotating or augmenting the message at subsequent points by other publishers / observers are ignored), which you are presumably relying upon being smooshed into a plain-text log file, right along side many other messages which do not fit this format. Presumably this could lead to issues like traceback lines with '=' assignments in them showing up in whatever kind of ad-hoc log-analysis you want to do later, and ambiguous unparseable results if any fields in the log message have a '=' and a space in their repr.

Let's not make it the responsibility of the emitter, or the format string, to get those values into a text log file that can be scanned with a regex (albeit in the case of your proposal, a much-simplified regex). Let's push that responsibility down into the log observer, and actually save it as structured data and *parse* it later.

As all the hip young kids will tell you, this type of data should be persisted with JSON, but we should probably have code to support multiple formats. Another advantage of serializing and parsing the logs as structured data would allow tools written to work on a "live" log observer to also analyze persistent logs in at least a similar way, if not entirely the same way.

(An object that has survived the grim ordeal of translation to and from JSON will surely be made less by the process, and can't be treated quite the same way by a log observer; but, as I'm sure you'll agree, this is probably better than the alternative of Pickle. An unpickled object will be brought fully back to life, but, in the words of the recent NIN release, it will have "came back haunted".)

Since it's also important to be able to *filter* logs, and avoid the work of formatting or serializing them at all, it would be best to push this work as far down the line as possible. The construction of the format string is itself a whole ton of string-copying, the sort of thing I'd really like to avoid. (Last I measured, dictionary operations that don't cause a reallocation - and most don't, unless you're inserting in a long loop - are cheaper than even apparently small string copies.)

In fact, it strikes me that perhaps *this* is the feature we should be talking about implementing: writing a structured log observer and parser, firming up the contract between emitters and observers so it's clear what to do with data that can't be serialized - more than exactly which fields we use to identify messages that are interesting to us.

Relatedly, we could make log analysis a first-class task; instead of assuming someone will grep the log file later, actually provide a simple log-query tool that users can plug their code into, presuming they're using a log format which is cleanly parseable. Maybe have a little web resource that serves up nicely formatted views of your logs.

I can implement the UUID thing as a completely separate ticket and we can discuss it on its own merits; meanwhile, you can happily add log IDs to everything in your applications. I'd be quite happy to table this part of the conversation for a separate ticket; I just thought it would be a nice thing that could fall out of some of the other parts of a new logging system.