[CentralOH] Anyone uses nteract papermill? Call for opinions on using jupyter notebooks in production
pybokeh at gmail.com
Wed Mar 13 00:46:32 EDT 2019
At my new role, I started to use papermill (admittedly with reservation and
skepticism) to orchestrate the execution of parameterized jupyter notebooks
used to automate my group's data processing needs. The result? I am now
totally blown away and thrilled with papermill (
https://github.com/nteract/papermill). At my previous role, I had used
jupyter notebook to prototype a data process to automatically create
invoices. But to productionalize (sadly, didn't know papermill existed at
the time), I had to port my jupyter notebook code to using a data
pipelining framework called Luigi (https://github.com/spotify/luigi).
Now with papermill, I see no need for me to port my jupyter notebooks to
something completely different and thus incur more technical debt from
having to learn yet another framework. With papermill, I define which
jupyter notebooks run and what they depend on (dependency management).
Furthermore, when a failure occurs, I can fix the bug, then run only the
notebooks that didn't run (failure recovery) instead of re-running
everything. So with these features afforded to me in papermill, the
beauty is I can keep developing in jupyter notebooks as data pipeline
scripts and use them for production. Also, with papermill, those that
don't come from a software engineering or computer science background, can
perform practical "data engineering" without having to learn sophisticated
data engineer tools like Apache Airflow or Spotify Luigi, since papermill
is far easier to learn, use, and get started than those heavyweight
frameworks. Furthermore, a well annotated jupyter notebook is so much
easier to debug and explain what it is doing to a team member. If you've
avoided jupyter notebooks because of the tediousness of version controlling
them or not being able to work with them using your favorite text editor or
IDE, there is now jupytext (https://github.com/mwouts/jupytext) which
allows you to edit notebooks as plain text. I think with all of these
things in place, there is little to not like about using papermill to
execute notebooks as workflows. However, since I am just a data analyst or
"citizen developer", I would like to hear opinions/thoughts from seasoned
IT professionals or data engineers who have worked with jupyter notebooks.
I know of a couple limitations with papermill, it does not come with a
means for scheduling tasks and it doesn't come with a fancy GUI or
visualization tool. But would like to see what others come up with.
I was inspired to give papermill a spin from Netflix's blog article:
I figured if papermill/jupyter notebooks were a good fit for Netflix, I
explored if they could be a good fit for my uses at work. So far, I am
pleased with the initial results.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the CentralOH