Beyond Jupyter: what’s in a notebook?
Yesterday I participated (as a visitor) in the kickoff meeting for OpenDreamKit, where one recurrent topic of discussion was notebooks, both Jupyter and Sage, including the question if they could be brought together. This reminded me of a recent blog post by Kirill Pomogajko entitled “Why I don’t like Jupyter”. And it reminded me of my own long-term project of integrating Jupyter with my ActivePapers system for reproducible research. That’s three reasons for writing down my thoughts about notebooks and their role(s) in computational research, so here we go.
One key observation is in Gaël Varoquaux’s comment on Kirill’s blog post: using Jupyter for doing science creates a lock-in, because all collaborators on a project must agree on using Jupyter. There is no other tool that can be used productively for working with notebooks. It’s a case of “wordization”: digital content is taken hostage by a tool that defines a storage format for its own convenience without much consideration for other tools, be they competing or complementary. Wordization not only restricts the users’ freedom to work with their data, but also creates headaches for the future. A data format defined by a tool can easily become unusable as the tool evolves and introduces incompatibilities, or of course if it disappears. In the case of Jupyter, its developers have always provided upgrade paths for notebooks between versions, but at some time this is bound to create trouble. Bugs are a fact of life, and I don’t expect that the version-2-compatibility-feature will get much testing in Jupyter version 23. To make it worse, a Jupyter notebook can depend on third-party code that implements embedded widgets. This is one of the reasons why I don’t use Jupyter for my research, although I am a big fan of using it for teaching. The other reason is that I cannot usefully link a notebook to other relevant information, such as code and data dependencies. Jupyter doesn’t provide any functionality for this, and they are hard to implement externally exactly because of wordization.
Wordization is often associated with evil intentions of market dominance, as they are regularly assumed for a company like Microsoft. But I believe that the fundamental cause is the obsession with tools over content that has driven the computing industry for many years. The tool aspects of a piece of software, such as its feature list and its user interface, are immediately visible. On the contrary, its data model attracts attention only by a few specialists, if at all. Users feel the consequences of bad (or absent) data model design through the symptoms of wordization, in particular lock-in, but rarely understand where it comes from. Interestingly, this problem was also mentioned yesterday at the OpenDreamKit meeting, by Michael Kohlhase who discussed the digital representation of mathematical knowledge and the difficulty of exchanging it between different software tools. I have written earlier about another aspect, the representation of scientific models in computational science, which illustrates the extreme case of tools having absorbed scientific content to the point that its users don’t even realize that something is missing.
Back to notebooks. Let’s forget about tools for the moment and consider the question of what a notebook actually is, as a digital document. I think that notebooks are trying to be two different things, and that many of the problems we have with them come from this ambiguity. One role of notebooks is the documentation of computational work as a narrative with direct access to the data. This is why people publish notebooks. The other role is as a protocol of interactive explorative work, i.e. the computational scientist’s equivalent of a lab notebook. The two roles are not completely unrelated, but they still significatively different.
To see the difference, look at how experimental scientists worked in the good old days of pencil, paper, and the printing press. As experiments were done, all the relevant information (preparation, results, …) was written down, immediately, with a time stamp, in the lab notebook. Like a bank ledger, a lab notebook is an immutable protocol of what happened. You don’t go back and change earlier entries, that would even be considered fraud. You just add information at the end. Of course, the resulting protocol is not a good way to communicate one’s findings. Therefore they are distilled and written up in a separate narrative, which surrounds a description of the work and its most important results by a motivating introduction and summarizing conclusions. This is the classic scientific article.
Today’s computational notebooks are trying to be both protocol and narrative, and pretend that there is a fluent transition between them. One unfortunate consequence is that computational protocols disappear as they are edited to become narratives. This could be alleviated by keeping notebooks under version control, but I have yet to see good versioning support in any notebook-type tool. But, fundamentally, today’s notebook tools don’t encourage keeping a protocol. They encourage frequent changes to the code and the results, keeping only the latest version. As editors for narratives, notebook tools are also far from ideal because they encourage interactive execution of small code snippets, making it easy to lose track of what was actually executed and in what order. In Jupyter, the only way to ensure a coherent narrative is to (1) restart the kernel and (2) re-execute all cells. There is not even a single menu entry for this operation. Actually, I wonder how many Jupyter users are aware that they must restart the kernel before re-executing all the cells if they want to ensure reproducibility.
With all that said, here is my current idea of what a notebook should look like at the bit level. A notebook data model should have two distinct entries, one for a protocol and one for a narrative. The protocol entry is a sequence of code cells and results, as they were executed since the start of the computation (for Jupyter, that means the last kernel restart). The narrative is a user-edited sequence of code cells, documentation cells, and results. The actual cell contents could well be shared between the two views: store each cell with a unique ID, and make the protocol and the narrative simple lists of IDs. The representation of code and documentation cells in such a data model is straightforward, though there’s a huge potential for bikeshedding in defining the details. The representation of results is much more difficult if you want to support more than plain text output. In the long run, it will be inevitable to define clear data models for every type of display widget, which is a lot of work.
From the tool point of view, the current Jupyter interface could be complemented by a non-editable protocol view. I’d also like to see a single command (menu/keyboard) for the “clean slate” operation: save the current state as a snapshot (or commit it directly to version control), restart the kernel, and re-initialize the protocol to an empty list. But what really matters to me is the data model. Contrary to the current one implemented in Jupyter, the one outlined above could be integrated into workflow management and archivation tools, such as my own ActivePapers. We’d probably see an Emacs mode for working with it as well. Plus pretty-printing tools, analysis tools, etc. We’d see an ecosystem of tools working with notebooks. A Dream of Openness.
Explore posts in the same categories: Computational science, Reproducible researchTags: Python, scientific computing
You can comment below, or link to this permanent URL from your own site.
September 8, 2015 at 06:13
Lots of great thoughts here.
I agree that the kernels record of execution (what you call the protocol) and the users narrative are two things that both deserve their own data model and persistence. It would not be difficult to write a kernel monitor that saves the kernel’s record of execution using the same JSON input/output format that is in the notebook. I don’t think it would be too difficult to store both views in a notebooks data model and thin it would be very worth exploring those ideas.
Also, some of the frontend work we are doing will open the door for user interfaces that are hybrids of narrative focused and order-of-execution models. The simplest order-of-execution data model is just a text file of code :)
Also agree that the “restart and run all” action should be exposed to users through keyboard/menu/toolbar. That is easy to do and we should just do it.
But there are some fundamental abstractions that would make it challenging to combine the kernel record with the narrative document into one overall document model in the general case:
1. In a multiuser context, you can end up with one co-edited narrative that is connected to separate kernels for each user. We are still working through the basic questions related to the usability of such as system and how the kernel records of multiple users get combined into the single shared narrative. The multiuser stuff will force us to start think more about this though.
2. There is no promise that kernels will be running on the same system where the notebook narratives are stored. A user might be storing the narrative document in a SQL DB, but using a kernels on a large RAM instance on AWS, with completely separate authentication contexts. This limitation is more practical that abstraction related, but it still has to be dealt with.
These challenges don’t prevent us from exploring things pretty easily though.
However….I think your arguments about the Jupyter notebook format itself are a bit misleading and contradictory. You claim that the Jupyter notebook document format:
* “creates lock in”
* “restricts the users’ freedom to work with their data”
* “upgrade paths for notebooks between versions, but at some time this is bound to create trouble. Bugs are a fact of life”
* “no other tool that can be used productively for working with notebooks”
This is misleading because the Jupyter notebook is already a completely open JSON document format. We even have a formal JSON spec for it:
https://github.com/jupyter/nbformat/blob/master/nbformat/v4/nbformat.v4.schema.json
And there is already an ecosystem of tools for working with notebooks and notebook content outside the official Jupyter web-application. The existing Jupyter Notebook format is as open and non-locked-in a format as possible. We even have a test suite :)
You then propose a new data model/standard that includes both the narrative and protocol data and claim that new data model would lead to “an Emacs mode for working with it as well. Plus pretty-printing tools, analysis tools, etc. We’d see an ecosystem of tools working with notebooks. A Dream of Openness.”
While I agree that we should explore differentiating and clarifying the order-of-execution and narrative components in the data model, I fail to see how such a data model would be more open, less buggy, involved less lock-in, or have a broader ecosystem of tools that the existing notebook format.
September 8, 2015 at 08:49
Thanks for those observations and comments!
Let me address the question of openness first. I see openness as having three levels:
Jupyter, like today’s Word, is at level 2. Everything is documented, but the data model is so closely tied to Jupyter’s functionality and design that it is of little use for any other software. Moreover, the format changes frequently (another sign for it being tied to Jupyter), which discourages people from adapting tools that work on, say, Python scripts (pylint etc.) to handle notebooks as well.
I can think of at least two software ecosystems that a truly open notebook format should be designed for:
In my ideal world, notebook execution would be supervised by a workflow manager that ensures provenance tracking and dependency management with non-notebook data (other software, databases, local files, …). The final notebook could become part of a publication that also contains content created by different tools (theory chapters, …).
Now for the technical points. Your observations about multiuser and distributed scenarios are indeed important. Here’s a modified proposal, which should also please Greg Wilson who always (and rightly) insists on the importance of diffing and merging.
First layer: a sequence of code cells.
Second layer: an execution log, i.e. a list of (code, result) pairs with “code” being a pointer into the first layer, plus some information about the runtime engine (language, version, …).
Third layer: a narrative consisting of a sequence of documentation with pointers to code from layer 1 and results from layer 2. Note that this narrative does not have to be constrained to a sequence of cells as with the current notebook format, though such a constraint may be useful for collaborative editing.
The three layers can be stored in a single document, but also separately, using some suitable cross-reference technique. A single layer-3 narrative can reference multiple data items at layers 1 and 2, allowing the data items to be stored in a distributed fashion. In the end, a single combined document can be made for archiving and publishing.
In a multiuser setting, each user has a layer-1/2 document that others can see but not modify. The shared layer-3 document can reference everyone’s layer-1/2 data.
Diffing and merging can be done at layers 1 and 3. Layer 2 is a mathematical function of the layer-1 data, so it can be diffed but not edited or merged.
An added advantage of separating these layers for Jupyter is that layers 1 and 2 can be managed completely by the kernel. This makes it possible to include code fed to the kernel from some other source than the Web notebook into the layer-1/2 data, improving reproducibility.
September 9, 2015 at 03:12
This ideas around these layers are interesting and worth pursuing. I still don’t agree about the lack of an ecosystem of other tools for working with notebooks. GitHub renders notebooks in place, O’Reilly media has integrated notebooks into their publishing platform, there are other open-source tools for working with notebooks in different ways. This stuff is new and immature, but that ecosystem is growing faster than we can keep track of.
But I completely agree with you about the need to explore these different layers. I think the idea of a “code cell” that includes both input (source code) and output (a sequence of mime bundles) is a fundamental abstraction that applies across those layers well.
The first two layers you outline above are really just a sequence of such code cells, in the order they were run in the kernel’s current session.
The narrative layer is then a subset of those code cells, intermixed with narrative cells.
The notebook’s model for code cells is super simple and could be applied easily across these different views. We are even extracting our JavaScript code for working with code cells as standalone npm packages, so other folks can easily build other UIs for working with code cells.
And it wouldn’t be too difficult to refer to cells by reference across these different views. Also:
* Cells would need ids.
* To get things moving, we could build a standalone tool for recording the input/output of a kernel that also refers to those ids. That could be done as a separate process that listens on a kernels message channels. Min could probably do this in 10 minutes ;-)
* Then we can think about how to maintain cell references in the two contexts.
There are lots of other important things, to address though, such as what happens to the kernel-record when it is restarted. Technically, the kernel record should be nullified – but maybe that is where providence tracking ideas come into play – there could be multiple such records. We would also have to figure out a way to update references in the narrative when code is rerun and the kernel record changes.
I think the next step would be to open an issue on the jupyter enhancement proposal repo to see if a formal enhancement proposal is appropriate.
https://github.com/jupyter/enhancement-proposals
September 9, 2015 at 08:26
There is indeed the beginning of an ecosystem around notebooks, but all those tools are just subsets of the Jupyter functionality repackaged for particular contexts. My idea of ecosystem integration is integration with tools that handle substantially different tasks.
What I particularly regret is that notebooks are in theory a useful tool for advancing reproducibility, but today’s implementations (not just Jupyter, but also Sage and Mathematica) are also major obstacles on the way to that goal. For the background, I refer to my recent paper, in particular the sections “Replication, reproduction, and reuse” and “Evaluation of existing technology”. In short, notebooks are good for reproduction (because they help to explain a computational procedure while at the same time being complete and precise because they contain the code), but a catastrophe for replication (because there is no dependency management and no reuse).
Concerning the technical points, I agree with 90% of what you say, but in particular I agree that this discussion should better take place on GitHub, so I will move on there.
September 9, 2015 at 11:15
I started writing an enhancement proposal. Contributions and comments are very welcome!
September 9, 2015 at 16:02
I agree that notebooks as currently implemented don’t address replicability in any way. At the same time, some of the most promising efforts in that direction layer replicability on top of notebooks. Have you seen binder?
http://mybinder.org/
See you on GitHub!
September 9, 2015 at 17:08
Yes, I have seen binder, though I never got it to work. But binder can’t fix the replicability issues with notebooks, because relevant information is missing from the notebooks themselves. Binder preserves the computational environment, as do other tools. But replication of the computation in a notebook also requires a complete log of all code execution since the last kernel restart. That’s the problem I am addressing in my proposal for a new notebook format.
March 15, 2016 at 23:14
Great post with a lot of interesting thoughts behind it. I gave a talk at OSCON 2015 that mirrors a number of your thoughts in some ways: https://www.youtube.com/watch?v=JI1HWUAyJHE although I approach the Jupyter Notebook “2 data models” as a given, and try to offer advice about how to operate in that space.
March 17, 2016 at 09:27
Can you summarize your ideas in a few sentences? I am not very motivated to watch a 35-minute video that mostly talks about stuff I know just to pick out a few new points.
March 17, 2016 at 16:49
Sure, I wrote a short blog post about the high level ideas here: http://www.svds.com/jupyter-notebook-best-practices-for-data-science/
March 17, 2016 at 17:55
Thanks! I’d call your approach a workaround rather than a solution, but it’s definitely an interesting one!
July 28, 2016 at 22:02
The op and comments were a very interesting read. It looks like the conversation has been silent for awhile, but I’ll go ahead and provide my thoughts. I’ve been a Mathematica user for years and have started in Jupyter only recently.
My goal when writing a notebook has evolved but ultimately become this: develop the set of documentation, code, and displays that explain the process, do the computation, and show the results, all in chronological execution order. Understanding how important the restart-execute-notebook step is to developing this kind of notebook is essential to working in a notebook environment. I do all kinds of hacky things in between each restart cycle to get my code doing what I want and to verify it’s doing what I want, but the real test comes when I restart and make sure each operation worked as expected from the ‘prototype’ code. With that model I think I’m, per your definition, iteratively arriving at a replicable state. My code is intended to cover the whole analysis process from start (data import) to finish with no dependencies on a preexisting kernel. In that situation the results can always be replicated. Admittedly the scale of my work is probably small compared to many of the modern scientific endeavors. Is it just not practical to expect that large-scale computation be run from “start” to “finish” during the development cycles? If yes, why use a REPL in the first place? Breaking up the process in to appropriately sized sub-steps, each with their own replicablely developed notebook could support that environment, no?
You could certainly argue that the available tools are inadequate because successful replicability depends on how you use the tool, but that’s kind of true of any tool.
On another note I’ve often used a parallel notebook running in the same or sometimes other kernels. In the same kernel I use the second as a scratch space for operating on the data in the primary notebook. This helps keep the original clean, and I only ‘import’ the code from the scratch space when I’ve got the functions cleaned up. When using a separate kernel, the process is pretty similar, but I have to manually create or port over any data I need to the scratch area. I could see use in having a sub-kernel scoped such that it had access to materials in the super-kernel but not the other way around. I think I could implement something like that in Mathematica fairly easily, but it’s not a default option. I’m not sure about Jupyter.
Hopefully that wasn’t all based on a complete misinterpretation of your conversation.
August 1, 2016 at 09:59
Thanks for your feedback, and in particular the detailed description of how you work with notebooks!
I think your approach illustrates what I described as the transition from a protocol to a narrative. It shows that this is well possible, but requires a strict discipline. I don’t know about Mathematica, but Jupyter doesn’t do much to encourage this kind of approach. There have been improvements in recent versions, in particular “restart kernel and re-run all the code” is now available as a single menu entry.
You mention larger computations, which are indeed a challenge with notebooks. When I tried Jupyter for a real research project, I ended up with notebooks that took about an hour to fully recompute, which makes it very tempting not to do it too often. I worked with inconsistent notebooks most of the time and did a recompute once a day during lunch break, but I wasn’t particularly happy with this. Your idea of turning sub-steps into individual notebooks sounds nice in theory, but it requires some form of dependency handling between notebooks. Again, I don’t know about Mathematica, but Jupyter does not support this.