Fwd: [Python-Dev] An yocto change proposal in logging module to simplify structured logs support

Hi Python-Ideas ML, To resume quickly the idea: I wish to add "extra" attribute to LogMessage, to facilitate structured logs generation. For more details with use case and example, you can read message below. Before to push the patch on bugs.python.org, I'm interested in by your opinions: the patch seems to be too simple to be honest. Regards. -- Ludovic Gasc (GMLudo) http://www.gmludo.eu/ ---------- Forwarded message ---------- From: Guido van Rossum <guido@python.org> Date: 2015-05-24 23:44 GMT+02:00 Subject: Re: [Python-Dev] An yocto change proposal in logging module to simplify structured logs support To: Ludovic Gasc <gmludo@gmail.com> Ehh, python-ideas? On Sun, May 24, 2015 at 10:22 AM, Ludovic Gasc <gmludo@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On Mon, May 25, 2015 at 12:26:33AM +0200, Ludovic Gasc wrote:
The documentation for the logging module already includes a recipe for simple structured logging: https://docs.python.org/2/howto/logging-cookbook.html#implementing-structure... At the other extreme, there is the structlog module: https://structlog.readthedocs.org/en/stable/ How does your change compare to those? -- Steve

2015-05-25 4:19 GMT+02:00 Steven D'Aprano <steve@pearwood.info>:
If I understand correctly this recipe, it's "only" to standardize log message content => not really sysadmin friendly to be read, but the most important, you must continue to parse and construct a database of structured logs to query inside. When you have more than 400+ log messages each second on only one server, rebuild the data structure isn't a negligible cost, contrary to push directly a structured data directly on the wire, directly understandable by your structured log daemon.
At the other extreme, there is the structlog module:
Thank you for the link, it's an interesting project, it's like "logging" module but on steroids, some good logging ideas inside. However, in fact, if I understand correctly, it's the same approach that the previous recipe: Generate a log file with JSON content, use logstash-forwarder to reparse the JSON content, to finally send the structure to logstash, for the query part: https://structlog.readthedocs.org/en/stable/standard-library.html#suggested-...
How does your change compare to those?
In the use case of structlog, drop the logstash-forwarder step to interconnect directly Python daemon with structured log daemon. Even if logstash-forwarder should be efficient, why to have an additional step to rebuild a structure you have at the beginning ? It's certainly possible to monkey patch or override the logging module to have this behaviour, nevertheless, it should be cleaner to be directly integrated in Python. Moreover, in fact, with the "extra" parameter addition, 99% of the work is already done in Python, my addition is only to keep explicit the list of metadata in the LogMessage. The nice to have, at least to me, is that extra dict should be also usable to format string message, to avoid to pass two times the same information. If I don't raise blocking remarks in this discussion, I'll send a patch on bugs.python.org. Regards.

On Monday, May 25, 2015 6:57 AM, Ludovic Gasc <gmludo@gmail.com> wrote:
2015-05-25 4:19 GMT+02:00 Steven D'Aprano <steve@pearwood.info>:
However, in fact, if I understand correctly, it's the same approach that the previous recipe: Generate a log file with JSON content, use logstash-forwarder to reparse the JSON content, to finally send the structure to logstash, for the query part: https://structlog.readthedocs.org/en/stable/standard-library.html#suggested-...
How does your change compare to those?
In the use case of structlog, drop the logstash-forwarder step to interconnect directly Python daemon with structured log daemon.
Even if logstash-forwarder should be efficient, why to have an additional step to rebuild a structure you have at the beginning ?
You can't send a Python dictionary over the wire, or store a Python dictionary in a database. You need to encode it to some transmission and/or storage format; there's no way around that. And what's wrong with using JSON as that format? More importantly, when you drop logstash-forwarder, how are you intending to get the messages to the upstream server? You don't want to make your log calls synchronously wait for acknowledgement before returning. So you need some kind of buffering. And just buffering in memory doesn't work: if your service shuts down unexpectedly, you've lost the last batch of log messages which would tell you why it went down (plus, if the network goes down temporarily, your memory use becomes unbounded). You can of course buffer to disk, but then you've just reintroduced the same need for some kind of intermediate storage format you were trying to eliminate—and it doesn't really solve the problem, because if your service shuts down, the last messages won't get sent until it starts up again. So you could write a separate simple store-and-forward daemon that either reads those file buffers or listens on localhost UDP… but then you've just recreated logstash-forwarder. And even if you wanted to do all that, I don't see why you couldn't do it all with structlog. They recommend using an already-working workflow instead of designing a different one from scratch, but it's just a recommendation.

2015-05-25 22:08 GMT+02:00 Andrew Barnert <abarnert@yahoo.com>:
Sorry for the delay, I was very busy since one month.
Maybe I should be more clear about my objective: I'm trying to build the simplest architecture for logging, based on the existing python logging features because all Python libraries use that, and with similar features that with ELK (Elasticsearch, Logstash, Kinbana). On the paper, the features ELK are very interesting for a sysadmin and, with my sysadmin hat, I'm strongly agree with that. The issue is that, based on my experience where I work, (sorry eventual ELK-lovers on this ML), but, it's very complicated to setup, to maintain and to keep scalable when you have a lot of logs: We passed a lot of time to have a working ELK, and finally we dropped that because the cost of maintenance was too important for us compare to use grep in rsyslog logs. Maybe we aren't enough smart to maintain ELK, it's possible. However, if we're not too smart to do that, certainly some people have the same issue as us. In fact, the issue shouldn't be our brains, but it was clearly a time consuming task, and we have too much directly paid-work to take care. Don't be wrong: I don't say that ELK doesn't work, only it's time consuming with a high level of logs. I'm pretty sure that a lot of people are happy with ELK, it's cool for them ;-) It's like Oracle and PostgreSQL databases: Where with Oracle you need a full-time DBA, with PostgreSQL: apt-get install postgresql With this last sentence, I'm totally caricatural, but only to show where I see an issue that should be fixed, at least for us. (FYI, in a previous professional life, I've maintained Oracle, MySQL and PostgreSQL servers for several clients, I know a little bit the subject).
More importantly, when you drop logstash-forwarder, how are you intending
In the past, we used directly a local rsyslog to play this role on each VM, connected directly with the Python daemons via a datagram UNIX socket. See a logging config file example: https://github.com/Eyepea/cookiecutter-API-Hour/blob/master/%7B%7Bcookiecutt... Now, it's journald that plays this role, also via a datagram UNIX socket.
You're right: Don't reinvent the wheel. However, if I follow your argument in another context: instead of to create AsyncIO, Guido should integrate Twisted in Python ? As an end-user of Twisted and AsyncIO, it isn't for the pleasure or to be fancy that we migrated from Twisted to AsyncIO ;-) To me, the expression should be: "Don't reinvent the wheel, except if you can provide a more efficient wheel" Now, in the context of logging: Please let me to try another approach, maybe I'll waste my time, or maybe I'll find an interesting gold nugget, who knows before to dig ? You can think I'm trying to be only different from the "ELK" standard, and it's possible, who knows ? If I revive this thread, it isn't to troll you, but because I'm interested in by your opinion. I may found a better approach that doesn't need a CPython patch and it's more powerful. In the source code of logging package, I've found this: https://github.com/python/cpython/blob/master/Lib/logging/__init__.py#L269 BTW, this approach should have more promotion: I didn't know you can use a dict to replace text in a log message, I thought only strings. Now, instead of to use extra parameter, I use directly this feature. For the developer, instead to write this: logger.debug('Receive a create_or_update request from "%s" account', account_id) he writes this: logger.debug('Receive a create_or_update request from "%(account_id)s" account', {'request_id': request.request_id, 'account_id': account_id, 'aiohttp_request': request, 'payload': payload}) With that, you can write logs as usual in your source code, and use the handler you want. However, if you use the systemDream handler, all metadata with your log will be sent to journald: https://github.com/Eyepea/systemDream/blob/master/src/systemdream/journal/ha... The another bonus of this approach is that you can use an element of your dict to improve your log message. With my previous approach with extra parameter, you must pass two times the values. The cherry on the cake is that extra can be used for something else. And the bonus of bonus, for the developers who already use this logging feature, they are already journald compliant without to know. I see no drawbacks of this approach, except that the developers who already use this feature: he must be consistent in the key names of the dict to be useful with journald. I'm very interested in by your feedbacks, maybe I've missed something. If anybody doesn't find an issue, I'll push this pattern also in the official Python binding of journald, systemDream is only my laboratory to experiment around systemd/journald (and secondarily, it's impossible to setup the official Python binding of systemd/journald in a pyvenv, at least to me). I'll publish also a step-by-step tutorial for the new comers on my blog. Thanks for your attention. -- Ludovic Gasc (GMLudo) http://www.gmludo.eu/

On 29 Jun 2015 1:50 am, "Ludovic Gasc" <gmludo@gmail.com> wrote: them ;-)
It's like Oracle and PostgreSQL databases: Where with Oracle you need a
full-time DBA, with PostgreSQL: apt-get install postgresql
This discrepancy in manageability between services like PostgreSQL & more complex setups like the ELK stack is why Red Hat started working on Nulecule as part of Project Atomic: http://rhelblog.redhat.com/2015/06/23/announcing-yum-rpm-for-containerized-a... There's still some work to be done making sure the related tools support Debian and derivatives properly, but "the ELK stack is too hard to install & maintain" is a distro level software management problem to be solved, rather than something to try to work around at the language level. Cheers, Nick.
From my point of view, the features in journald are enough to replace
most usages of ELK, at least for us, and, contrary to ELK, journald is already installed in all latest Linux distributions, even in Debian Jessie. You have almost no maintenance cost.
More importantly, when you drop logstash-forwarder, how are you
intending to get the messages to the upstream server? You don't want to make your log calls synchronously wait for acknowledgement before returning. So you need some kind of buffering. And just buffering in memory doesn't work: if your service shuts down unexpectedly, you've lost the last batch of log messages which would tell you why it went down (plus, if the network goes down temporarily, your memory use becomes unbounded). You can of course buffer to disk, but then you've just reintroduced the same need for some kind of intermediate storage format you were trying to eliminate—and it doesn't really solve the problem, because if your service shuts down, the last messages won't get sent until it starts up again. So you could write a separate simple store-and-forward daemon that either reads those file buffers or listens on localhost UDP… but then you've just recreated logstash-forwarder.
In the past, we used directly a local rsyslog to play this role on each
VM, connected directly with the Python daemons via a datagram UNIX socket.
See a logging config file example:
https://github.com/Eyepea/cookiecutter-API-Hour/blob/master/%7B%7Bcookiecutt...
Now, it's journald that plays this role, also via a datagram UNIX socket.
And even if you wanted to do all that, I don't see why you couldn't do
it all with structlog. They recommend using an already-working workflow instead of designing a different one from scratch, but it's just a recommendation.
You're right: Don't reinvent the wheel. However, if I follow your argument in another context: instead of to
create AsyncIO, Guido should integrate Twisted in Python ?
https://github.com/Eyepea/systemDream/blob/master/src/systemdream/journal/ha...
The another bonus of this approach is that you can use an element of your
With my previous approach with extra parameter, you must pass two times
dict to improve your log message. the values.

2015-06-29 3:09 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:
Thanks for the link, I didn't know, I'll keep an eye on that.
As you can imagine, I want to finish to dig around journald before to change the strategy. Thanks for your time and your remarks. Have a nice day.

On Mon, May 25, 2015 at 12:26:33AM +0200, Ludovic Gasc wrote:
The documentation for the logging module already includes a recipe for simple structured logging: https://docs.python.org/2/howto/logging-cookbook.html#implementing-structure... At the other extreme, there is the structlog module: https://structlog.readthedocs.org/en/stable/ How does your change compare to those? -- Steve

2015-05-25 4:19 GMT+02:00 Steven D'Aprano <steve@pearwood.info>:
If I understand correctly this recipe, it's "only" to standardize log message content => not really sysadmin friendly to be read, but the most important, you must continue to parse and construct a database of structured logs to query inside. When you have more than 400+ log messages each second on only one server, rebuild the data structure isn't a negligible cost, contrary to push directly a structured data directly on the wire, directly understandable by your structured log daemon.
At the other extreme, there is the structlog module:
Thank you for the link, it's an interesting project, it's like "logging" module but on steroids, some good logging ideas inside. However, in fact, if I understand correctly, it's the same approach that the previous recipe: Generate a log file with JSON content, use logstash-forwarder to reparse the JSON content, to finally send the structure to logstash, for the query part: https://structlog.readthedocs.org/en/stable/standard-library.html#suggested-...
How does your change compare to those?
In the use case of structlog, drop the logstash-forwarder step to interconnect directly Python daemon with structured log daemon. Even if logstash-forwarder should be efficient, why to have an additional step to rebuild a structure you have at the beginning ? It's certainly possible to monkey patch or override the logging module to have this behaviour, nevertheless, it should be cleaner to be directly integrated in Python. Moreover, in fact, with the "extra" parameter addition, 99% of the work is already done in Python, my addition is only to keep explicit the list of metadata in the LogMessage. The nice to have, at least to me, is that extra dict should be also usable to format string message, to avoid to pass two times the same information. If I don't raise blocking remarks in this discussion, I'll send a patch on bugs.python.org. Regards.

On Monday, May 25, 2015 6:57 AM, Ludovic Gasc <gmludo@gmail.com> wrote:
2015-05-25 4:19 GMT+02:00 Steven D'Aprano <steve@pearwood.info>:
However, in fact, if I understand correctly, it's the same approach that the previous recipe: Generate a log file with JSON content, use logstash-forwarder to reparse the JSON content, to finally send the structure to logstash, for the query part: https://structlog.readthedocs.org/en/stable/standard-library.html#suggested-...
How does your change compare to those?
In the use case of structlog, drop the logstash-forwarder step to interconnect directly Python daemon with structured log daemon.
Even if logstash-forwarder should be efficient, why to have an additional step to rebuild a structure you have at the beginning ?
You can't send a Python dictionary over the wire, or store a Python dictionary in a database. You need to encode it to some transmission and/or storage format; there's no way around that. And what's wrong with using JSON as that format? More importantly, when you drop logstash-forwarder, how are you intending to get the messages to the upstream server? You don't want to make your log calls synchronously wait for acknowledgement before returning. So you need some kind of buffering. And just buffering in memory doesn't work: if your service shuts down unexpectedly, you've lost the last batch of log messages which would tell you why it went down (plus, if the network goes down temporarily, your memory use becomes unbounded). You can of course buffer to disk, but then you've just reintroduced the same need for some kind of intermediate storage format you were trying to eliminate—and it doesn't really solve the problem, because if your service shuts down, the last messages won't get sent until it starts up again. So you could write a separate simple store-and-forward daemon that either reads those file buffers or listens on localhost UDP… but then you've just recreated logstash-forwarder. And even if you wanted to do all that, I don't see why you couldn't do it all with structlog. They recommend using an already-working workflow instead of designing a different one from scratch, but it's just a recommendation.

2015-05-25 22:08 GMT+02:00 Andrew Barnert <abarnert@yahoo.com>:
Sorry for the delay, I was very busy since one month.
Maybe I should be more clear about my objective: I'm trying to build the simplest architecture for logging, based on the existing python logging features because all Python libraries use that, and with similar features that with ELK (Elasticsearch, Logstash, Kinbana). On the paper, the features ELK are very interesting for a sysadmin and, with my sysadmin hat, I'm strongly agree with that. The issue is that, based on my experience where I work, (sorry eventual ELK-lovers on this ML), but, it's very complicated to setup, to maintain and to keep scalable when you have a lot of logs: We passed a lot of time to have a working ELK, and finally we dropped that because the cost of maintenance was too important for us compare to use grep in rsyslog logs. Maybe we aren't enough smart to maintain ELK, it's possible. However, if we're not too smart to do that, certainly some people have the same issue as us. In fact, the issue shouldn't be our brains, but it was clearly a time consuming task, and we have too much directly paid-work to take care. Don't be wrong: I don't say that ELK doesn't work, only it's time consuming with a high level of logs. I'm pretty sure that a lot of people are happy with ELK, it's cool for them ;-) It's like Oracle and PostgreSQL databases: Where with Oracle you need a full-time DBA, with PostgreSQL: apt-get install postgresql With this last sentence, I'm totally caricatural, but only to show where I see an issue that should be fixed, at least for us. (FYI, in a previous professional life, I've maintained Oracle, MySQL and PostgreSQL servers for several clients, I know a little bit the subject).
More importantly, when you drop logstash-forwarder, how are you intending
In the past, we used directly a local rsyslog to play this role on each VM, connected directly with the Python daemons via a datagram UNIX socket. See a logging config file example: https://github.com/Eyepea/cookiecutter-API-Hour/blob/master/%7B%7Bcookiecutt... Now, it's journald that plays this role, also via a datagram UNIX socket.
You're right: Don't reinvent the wheel. However, if I follow your argument in another context: instead of to create AsyncIO, Guido should integrate Twisted in Python ? As an end-user of Twisted and AsyncIO, it isn't for the pleasure or to be fancy that we migrated from Twisted to AsyncIO ;-) To me, the expression should be: "Don't reinvent the wheel, except if you can provide a more efficient wheel" Now, in the context of logging: Please let me to try another approach, maybe I'll waste my time, or maybe I'll find an interesting gold nugget, who knows before to dig ? You can think I'm trying to be only different from the "ELK" standard, and it's possible, who knows ? If I revive this thread, it isn't to troll you, but because I'm interested in by your opinion. I may found a better approach that doesn't need a CPython patch and it's more powerful. In the source code of logging package, I've found this: https://github.com/python/cpython/blob/master/Lib/logging/__init__.py#L269 BTW, this approach should have more promotion: I didn't know you can use a dict to replace text in a log message, I thought only strings. Now, instead of to use extra parameter, I use directly this feature. For the developer, instead to write this: logger.debug('Receive a create_or_update request from "%s" account', account_id) he writes this: logger.debug('Receive a create_or_update request from "%(account_id)s" account', {'request_id': request.request_id, 'account_id': account_id, 'aiohttp_request': request, 'payload': payload}) With that, you can write logs as usual in your source code, and use the handler you want. However, if you use the systemDream handler, all metadata with your log will be sent to journald: https://github.com/Eyepea/systemDream/blob/master/src/systemdream/journal/ha... The another bonus of this approach is that you can use an element of your dict to improve your log message. With my previous approach with extra parameter, you must pass two times the values. The cherry on the cake is that extra can be used for something else. And the bonus of bonus, for the developers who already use this logging feature, they are already journald compliant without to know. I see no drawbacks of this approach, except that the developers who already use this feature: he must be consistent in the key names of the dict to be useful with journald. I'm very interested in by your feedbacks, maybe I've missed something. If anybody doesn't find an issue, I'll push this pattern also in the official Python binding of journald, systemDream is only my laboratory to experiment around systemd/journald (and secondarily, it's impossible to setup the official Python binding of systemd/journald in a pyvenv, at least to me). I'll publish also a step-by-step tutorial for the new comers on my blog. Thanks for your attention. -- Ludovic Gasc (GMLudo) http://www.gmludo.eu/

On 29 Jun 2015 1:50 am, "Ludovic Gasc" <gmludo@gmail.com> wrote: them ;-)
It's like Oracle and PostgreSQL databases: Where with Oracle you need a
full-time DBA, with PostgreSQL: apt-get install postgresql
This discrepancy in manageability between services like PostgreSQL & more complex setups like the ELK stack is why Red Hat started working on Nulecule as part of Project Atomic: http://rhelblog.redhat.com/2015/06/23/announcing-yum-rpm-for-containerized-a... There's still some work to be done making sure the related tools support Debian and derivatives properly, but "the ELK stack is too hard to install & maintain" is a distro level software management problem to be solved, rather than something to try to work around at the language level. Cheers, Nick.
From my point of view, the features in journald are enough to replace
most usages of ELK, at least for us, and, contrary to ELK, journald is already installed in all latest Linux distributions, even in Debian Jessie. You have almost no maintenance cost.
More importantly, when you drop logstash-forwarder, how are you
intending to get the messages to the upstream server? You don't want to make your log calls synchronously wait for acknowledgement before returning. So you need some kind of buffering. And just buffering in memory doesn't work: if your service shuts down unexpectedly, you've lost the last batch of log messages which would tell you why it went down (plus, if the network goes down temporarily, your memory use becomes unbounded). You can of course buffer to disk, but then you've just reintroduced the same need for some kind of intermediate storage format you were trying to eliminate—and it doesn't really solve the problem, because if your service shuts down, the last messages won't get sent until it starts up again. So you could write a separate simple store-and-forward daemon that either reads those file buffers or listens on localhost UDP… but then you've just recreated logstash-forwarder.
In the past, we used directly a local rsyslog to play this role on each
VM, connected directly with the Python daemons via a datagram UNIX socket.
See a logging config file example:
https://github.com/Eyepea/cookiecutter-API-Hour/blob/master/%7B%7Bcookiecutt...
Now, it's journald that plays this role, also via a datagram UNIX socket.
And even if you wanted to do all that, I don't see why you couldn't do
it all with structlog. They recommend using an already-working workflow instead of designing a different one from scratch, but it's just a recommendation.
You're right: Don't reinvent the wheel. However, if I follow your argument in another context: instead of to
create AsyncIO, Guido should integrate Twisted in Python ?
https://github.com/Eyepea/systemDream/blob/master/src/systemdream/journal/ha...
The another bonus of this approach is that you can use an element of your
With my previous approach with extra parameter, you must pass two times
dict to improve your log message. the values.

2015-06-29 3:09 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:
Thanks for the link, I didn't know, I'll keep an eye on that.
As you can imagine, I want to finish to dig around journald before to change the strategy. Thanks for your time and your remarks. Have a nice day.
participants (4)
-
Andrew Barnert
-
Ludovic Gasc
-
Nick Coghlan
-
Steven D'Aprano