Making Computations and Publications Reproducible with VisTrails

Viewer
Transcript

Reproducible Research for Scientific Computing

Making Computations and Publications Reproducible with VisTrails The VisTrails system supports the creation of reproducible experiments. VisTrails integrates data acquisition, derivation, analysis, and visualization as executable components throughout the scientific exploration process, and through systematic provenance capture, it makes it easier to generate and share reproducible results. Using VisTrails, authors can link results to their provenance, reviewers can assess the experiment’s validity, and readers can repeat and utilize the computations.

I

mportant scientific results give insight and lead to practical progress. The ability to test these results is crucial for science to be self-correcting, and the ability to reuse and extend the results enables science to move forward. In natural science, long tradition requires that results be reproducible, and in math, results must be accompanied by formal, verifiable proofs. However, the same standard hasn’t been applied for the results of computational experiments. Most computational experiments are specified only informally in papers, where experimental results are briefly described in figure captions, and the code that produced the results is seldom available. The lack of reproducibility for computational results currently reported in the literature has raised questions about their reliability1 and led to a widespread discussion on the importance of computational reproducibility. Academic institutions such as the Swiss Federal Institute of Technology, Zurich (ETH, Zurich;

1521-9615/12/$31.00 © 2012 IEEE Copublished by the IEEE CS and the AIP

Juliana Freire and Claudio T. Silva Polytechnic Institute of New York University

2

This article has been peer-reviewed.

CISE-14-4-Freire.indd 2

www.vpf.ethz.ch/services/researchethics/Broschure. pdf ), f unding agencies, conferences (w w w. sigmod2011.org/calls_papers_sigmod_research_ r e p e at ab i l it y. s ht m l ), a nd jou r n a l s (w w w. signalprocessingsociety.org/publications/periodicals/ tsp) have started to encourage (or require) authors to include reproducible results in their publications. However, a major barrier to the wider adoption of reproducibility is the fact that it’s hard for authors to derive a compendium that encapsulates all the components (for example, the data, code, parameter settings, and environment) needed to reproduce a result; and even when a compendium is available, it’s often hard for reviewers to verify results. As a step toward simplifying the creation and review of reproducible results, and motivated by the needs of computational scientists, we built an infrastructure that supports the life cycle of computational experiments. A key component of this infrastructure is a provenance management system that systematically and transparently captures the metadata necessary to reproduce experiments, including the specifications of the computations, input and output data, source code, and library versions. We also developed a set of solutions to address practical aspects related to Computing in Science & Engineering

6/8/12 10:41 AM

Figure 1. Anatomy of a real reproducible paper that investigates Galois conjugates of quantum double models.3 Figures in the paper are accompanied by their provenance, and users and reviewers can execute and examine the interactive results on the Web.

reproducibility, including methods to link results to their provenance, explore parameter spaces, wrap command-line tools, interact with results through a Web-based interface, and upgrade the specification of computational experiments to work in different environments and with newer versions of software. This infrastructure has been implemented and released as part of VisTrails (www.vistrails.org), an open source workflowbased data exploration and visualization tool, 2 and it’s already being used by different groups of scientists. Videos that illustrate the process to create reproducible publications using VisTrails are available at www.vistrails.org/index.php/ RepeatabilityCentral. In this article, we give an overview of this infrastructure and its components, how it can be used, and its benefits and limitations.

Creating Reproducible Papers

Before discussing how to create a reproducible paper, let’s first examine a real reproducible paper. Figure 1 illustrates the anatomy of a reproducible paper created using our infrastructure. This paper investigates Galois conjugates of quantum double models.3 Figures in the paper are accompanied by their provenance, consisting of the workflow used to derive the plot, the underlying libraries invoked by the workflow, and links to the input data—simulation results stored in an archival site. July/August 2012

CISE-14-4-Freire.indd 3

This provenance information allows all of the paper’s results to be reproduced. In the paper’s PDF version (available at http://arxiv.org/abs/1106.3267), the figures are active and, when clicked, their corresponding workflow is loaded into VisTrails and executed on the reader’s machine. The reader can then modify the workflow, change parameter values, and input data. The same provenance also enables the result to be published on a website, where users and reviewers can execute it and examine the results using a Web browser.4 VisTrails and Provenance for Computational Experiments

Provenance is a critical ingredient for reproducible experiments.5,6 If we know how a figure or table was generated (the computational processes and data used), we can incorporate them in the paper so that the result can be reproduced. However, because computational experiments can be complex and their design involves many trial-anderror steps, it’s easy to get lost. For example, it’s easy to forget the exact parameter values or the version of an input file that was used to derive a specific result. Therefore, systematic mechanisms are needed to capture the provenance of these experiments. In our infrastructure, we’ve adopted the VisTrails system as a means to capture provenance. 3

6/8/12 10:41 AM

VisTrails and Related Systems

V

isTrails (see www.vistrails.org/usersguide) is an open source system designed to support exploratory computational experiments. VisTrails is written in Python and uses Qt as its GUI toolkit (through PyQt Python bindings). It is multiplatform and runs on Windows, Mac, and Linux. Since its beta release in 2007, the system has been downloaded more than 35,000 times. The VisTrails wiki has had more than 1.2 million page views, and Google Analytics reports that visitors to the site come from 75 different countries. VisTrails includes and substantially extends useful features of scientific workflow and visualization systems. Similar to scientific workflow systems such as Kepler (https://kepler-project.org) and Taverna (www.taverna.org. uk), VisTrails allows the specification of computational processes that integrate existing applications, loosely coupled resources, and libraries according to a set of rules. As with visualization systems such as Advanced Visual Systems

(www.avs.com) and ParaView (www.paraview.org), VisTrails makes advanced scientific and information visualization techniques available to users, letting them explore and compare different visual representations of their data. As a result, users can create complex workflows that encompass important steps of scientific discovery, from data gathering and manipulation to complex analyses and visualizations, all integrated in one system. There are two key aspects that distinguish VisTrails from these systems. First, it provides comprehensive provenance support: in addition to capturing data provenance (for example, the steps followed to create a given data product), VisTrails also captures provenance of the exploration process, including the trial-and-error refinements applied to workflows. Second, VisTrails has been a pioneer in the support of reproducible publications. It has introduced functionality for sharing and publishing computational experiments, including the ability to link results reported in a document to their provenance, run workflow in multiple environments, manage files manipulated by workflows, and automatically upgrade workflows when the underlying libraries change.

(For a basic overview of VisTrails and a discussion about other possible tools, see the sidebar “VisTrails and Related Systems.”) Compared to both scientific workflows and visualization systems, a distinguishing feature of VisTrails is its provenance infrastructure: VisTrails was designed from the start to both capture and leverage provenance information. VisTrails captures a detailed history of the steps followed and data derived in the course of an exploratory task. Workflow systems have traditionally been used to automate repetitive tasks, but in applications that are exploratory in nature, such as simulations, data analysis, and visualization, not much is repeated—change is the norm. As a user generates and evaluates hypotheses about their data, a series of different, but related, workflows are created as they’re adjusted iteratively. VisTrails was designed to manage these rapidly evolving workflows and maintains provenance of • data products (such as visualizations and plots), • the workflows that derive these products, • the workflow execution log, • information about the underlying tools and libraries invoked by the workflows, and • user-defined annotations that enrich the automatically captured provenance. VisTrails addresses important usability issues that have hampered a wider adoption of workflow 4

CISE-14-4-Freire.indd 4

and visualization systems. To cater to a broader set of users, including many who don’t have programming expertise, it leverages provenance information to provide a series of operations and user interfaces that simplify workflow design and use—including the ability to create and refine workflows by analogy, to query workflows by example, and to suggest workflow completions as users interactively construct their workflows using a recommendation system.5 We’ve also developed a framework that lets users create custom applications (mashups) that can be more easily deployed to end users.7 Sharing and Publishing Results

Although capturing provenance is a necessary step for creating reproducible results, other issues should be considered when sharing these results—notably, how they should be packaged so that they can be understood by people other than their authors as well as executed in environments different from the one where they were created. Specifying computations as workflows. In our in-

frastructure, computational processes are specified as pipelines (or workflows). Workflow systems support the creation of pipelines that combine multiple tools. As such, they enable the automation of repetitive tasks and result reproducibility. Workflows are rapidly replacing primitive shell scripts in a wide range of tasks, as evidenced by Computing in Science & Engineering

6/8/12 10:41 AM

several workflow-based applications, both commercial (Apple’s Mac OS X Automator and Yahoo! Pipes) and academic (NiPype, Kepler, and Taverna). Workflows have several advantages compared to scripts and programs written in high-level languages. They provide a simple programming model whereby a sequence of tasks is composed by connecting the outputs of one task to the inputs of another. Figure 1 shows an example workflow that reads a zip archive from the Web that contains a set of result files, uncompresses it, performs an analysis, and uses matplotlib to output a plot. This simpler programming model lets workflow systems provide intuitive visual programming interfaces, which makes them more appealing for users who don’t have substantial programming expertise. It also makes it possible to create abstractions to represent a given workflow at different levels of granularity thereby simplifying its presentation and making it easier for users to understand its functionality.7 Another benefit of using workflows is that they have an explicit structure: they consist of graphs, where nodes represent processes (or modules) along with their parameters, and edges capture the flow of data between the processes. We can exploit this structure to allow rich queries (and operations) over workflow collections, enabling knowledge reuse.8 To specify computations as a workflow, we must integrate with the workflow system the userdefined functions and libraries (such as simulation codes and specialized visualizations) required by the computations. VisTrails provides a package API for this purpose. A VisTrails package is simply a collection of Python classes—stored in one or more files—that respect specific conventions and that invoke the user-defined functions. For example, the workflow in Figure 1 makes use of a package that wraps the ALPS library (http:// alps.comp-phys.org), which provides high-end simulation codes for strongly correlated quantum mechanical systems. The system also provides a mechanism, CLTools, that simplifies the process of wrapping command-line tools. For more information about these, see the VisTrails User’s Guide (www.vistrails.org/usersguide). Even when the provenance associated with a result is represented as a workflow, shipping the workflow to be run in an environment different from the one in which it has been designed still raises many challenges. From hard-coded locations for input data to dependencies on specific library and hardware versions, adapting workflows to run in a new environment can be challenging July/August 2012

CISE-14-4-Freire.indd 5

and sometimes impossible. To address this issue, we developed mechanisms (which we detail later) that provide reliable links between data and their provenance9 and extended VisTrails to support the evolution of software libraries used in the workflows.10 Connecting data to provenance. The common practice of connecting workflows and data products through file names has important limitations. Consider, for example, a workflow that runs a simulation and outputs a file with a visualization of the simulation results. If the workflow outputs an image file to the file system, any future run will overwrite that image file. Also, if we use different parameters or improve the simulation code and run the updated workflow, the original image is lost. If a version-control system managed that image file, the user could retrieve the old version from the repository. However, if users revert the output image to the original version, how would they know how it was created? Because there’s no explicit link between the workflow instance (the workflow specification, parameters, and input files) and the different versions of its output, determining provenance is challenging. This problem is compounded when computations take place in multiple systems, and recording the complete provenance requires tying together multiple workflows through their outputs and inputs. As files are overwritten, renamed, or moved, provenance information could be lost or become invalid. VisTrails implements a persistence framework that, by coupling workflow provenance with the versioning of data produced and consumed by workflows, captures the actual changes to data as well as detailed information about how those changes came about.9 The persistence package contains modules that identify data by their contents (via hashing), its use (via the workflow that generated it), and its history (via a version-control system). Each version of a result is connected, through a strong link, to the provenance that details how the result was generated. This ensures that an author can always retrieve data used in previous work, even if the original file has been changed or removed. Instead of relying on users or ad hoc approaches to automatically derive file names, strong links are identifiers derived from the file content, the workflow specification, and any parameters used. As a result, they accurately and reliably tie a given workflow instance to its input and derived data.

5

6/8/12 10:41 AM

Besides simplifying the process of maintaining data provenance, this approach provides a general mechanism for the persistent caching of both intermediate and final results, which can be useful, for example, to package workflows that include long-running computations. Furthermore, the use of a managed data repository allows the creation of workflows that are location agnostic— that is, unlike workflows that point to files in the file system, these workflows can be shared and run in multiple environments unchanged. Software and computational environment. When

shipping experiments to be run in a new environment, configuration can be challenging. Although VisTrails is cross-platform, the code, libraries, and other dependencies underlying a workflow often aren’t. A good practice, when possible, is to package the environment (or a subset of it) together with the experiments. This can be achieved through the use of virtual machines or packaging systems such as CDE that capture system-level dependencies (www.pgbovine.net/cde.html). Another issue that must be addressed is software and hardware evolution. As systems evolve and hardware changes, simply archiving an executable won’t suffice. Even archiving code can be problematic as language specifications change. Archiving environments as virtual machines usually allows reproduction but presents issues with scale. Furthermore, while exact reproduction is important to verify results, utilizing published work to extend solutions is more efficient when modern tools and algorithms can be substituted. Such upgrades can accelerate progress by allowing readers to take advantage of the new hardware and infrastructure. To address this, VisTrails stores—as part of a workflow’s provenance—the exact version of each module used and provides an automatic workflow upgrade mechanism.10 If a reader downloads an old experiment, VisTrails can automatically upgrade the computations to match the current environment. This is accomplished in part by allowing package developers to specify upgrade paths when specifications change. Local, remote, and mixed execution. Because special hardware used in experiments isn’t always readily available and input data are sometimes too large, it’s important to provide a flexible mechanism that allows different kinds of execution. We’ve been exploring different strategies to support workflow execution on clusters using native code and via Kitware’s ParaView (www.paraview.org) for parallel execution. Controlling these executions from a 6

CISE-14-4-Freire.indd 6

local machine can be challenging, so we’ve developed workflow components to automate the monitoring process. Another concern is a result that uses a very large dataset or data that’s proprietary; although we might be able to run the workflow locally, we might need to remotely query or aggregate the data. We developed modules that work with relational databases to support such remote queries in local workflows. It’s also possible for authors to create modules that are specific to their data and provide remote access to their own computational infrastructure (hardware and software). Publishing

To support provenance-rich results in papers and the ability to link back to the workflows that derived them, we developed code and plug-ins for LaTeX, wiki, Microsoft Word, and PowerPoint. This lets authors easily embed and reproduce results, and lets readers follow links to and explore the actual computations. Figure 2a shows the LaTeX publishing interface provided by VisTrails. A user can configure how the results should be included—for example, to include a PDF figure and ship the workflow (Include .vtl) with the paper. This interface, in turn, generates a stub that can be included in a LaTeX file. Here’s an example of how to define the inclusion of a provenance-rich figure in a paper: \begin{figure} \vistrail[filename=ladder_dyl_ gap_theta-2.xml,version=5,pdf, buildalways,getvtl,embedworkflow, execute]{width=8cm} \caption{(color online) Ground-state degeneracy splitting of the non-Hermitian doubled Yang-Lee model when perturbed by a string tension ( = 0).} \label{fig:figure} \end{center} \end{figure}

By including the VisTrails LaTeX package (\usepackage{vistrails}) in the document, when the document is compiled, VisTrails is invoked to execute the \vistrails macro and insert the result in the PDF. The user can also choose to serve the workflow from a database, in which case, the \vistrail command specifies the host and database name: \vistrail[host=alps.ethz.ch,db= vistrails,vtid=10,version=169,pdf] {width=8cm}

Computing in Science & Engineering

6/8/12 10:41 AM

(a)

(b)

Figure 2. Embedding reproducible results in publications. VisTrails provides an interface that lets users include in papers provenance-rich results that can be executed on (a) the users’ client machine or (b) on a server (using the CrowdLabs site through a Web browser).

Additionally, it’s possible to have papers point to results that can be executed on a remote server. For this, we use CrowdLabs (www.crowdlabs. org), a social website where users can share their workflows with the associated provenance and results. A user can upload workflows to the CrowdLabs server, where they can be executed from a Web browser. As Figure 2b illustrates, CrowdLabs also generates stubs that can be included in documents, including LaTeX, a wiki, or an HTML page, so that these documents point to the executable workflow on the server. This makes it easier for readers to reproduce the results, because they don’t need to install any special software on their machines. We should mention that VisTrails includes a mashup editor that lets authors create Web mashups based on their workflows to make them more accessible to readers.7 For example, they can select a subset of the workflow’s parameters and suggest values for reviewers to explore computations. Similar to workflows, mashups can be made available through CrowdLabs, where readers and reviewers can interactively modify parameters and analyze the results. CrowdLabs uses Flash to support interactive visualizations, and we’re currently exploring newer technology, such as HTML5 and Web Graphics Library (WebGL) to provide better interaction with graphical results. Reviewing: Reproducibility and Workability

A reproducible paper has the potential to improve the quality of reviews, because reviewers have the ability to explore and validate conclusions. As we July/August 2012

CISE-14-4-Freire.indd 7

discussed, by appropriately packaging the experiments, it’s possible for a reviewer to reproduce and test the existing computations. In the absence of an infrastructure such as the one we developed, the reviewing task can be frustrating and timeconsuming. A reviewer also needs to test different parameter configurations and access and interact with the computations. VisTrails provides a parameter exploration interface for quickly selecting and setting ranges that’s coupled with an intuitive spreadsheet interface for visually comparing results.2 This lets reviewers assess how general a solution is, how much tuning it requires, and determine in which cases it fails. For journals and conferences that evaluate computational results, the availability of a common, unified infrastructure (such as the one we developed) guarantees the uniformity of representation across experiments so that reviewers need not relearn the experimental setup for each submission. This was an important motivation for suggesting the use of VisTrails in the guidelines for the ACM Sigmod Repeatability initiative (see www. sigmod2011.org/calls_papers_sigmod_research_ repeatability.shtml).

Looking Ahead: Challenges and Opportunities

We’ve described our efforts to simplify the creation, review, and reuse of reproducible experiments. Our infrastructure is in production; it has been used to create a number of papers and to support the ACM Sigmod Repeatability initiative. 7

6/8/12 10:41 AM

Our design has been guided by close collaborations with scientists who are authoring reproducible papers. We offer a more detailed description of real use cases for our reproducibility infrastructure elsewhere.11 End-to-end and long-term reproducibility of a scientific result is hard to achieve due to factors such as the use of specialized hardware, proprietary data, and inevitable changes in hardware and software environments. Nonetheless, with the infrastructure we’ve built, it’s possible to accurately document the processes through provenance capture, as well as to attain reproducibility for a result’s important subcomponents, such as the analysis and visualization of data derived from simulations run on special hardware. As it stands, the infrastructure consists of a set of core functionality required for reproducibility; it’s by no means comprehensive. Our long-term goal is to use it as the basis for a general system in which different components can be mixed and matched to cater to the range of requirements in different domains and for different types of scientific results. Our infrastructure’s current version is based on and requires the use of the VisTrails system. We recognize, however, that such an approach might not always be desirable—in particular, for research that requires the use of interactive tools that can’t be wrapped within a workflow system. We’re currently extending our infrastructure to support other systems that capture provenance. We’ve also developed a plug-in mechanism that leverages the VisTrails provenance subsystem to add provenance support to other tools12—including, for example, ParaView, VisIt, and Autodesk’s Maya.

A

s collections of reproducible computational experiments (along with their source code, raw data, workf lows, and provenance) become available in community-accessible repositories, new software can be built upon this verified base. In dataintensive areas of science, significant amounts of knowledge accumulated by practicing scientists as best practices aren’t necessarily formalized. By publishing the provenance of exploratory processes, such as that captured by VisTrails, it is possible to make this knowledge more precise, bring it to the light of science, and make it verifiable by others. The availability of fully documented experiments enables scientific advances that combine previous tools as well as ideas. For example, members of the community can search for related experiments

8

CISE-14-4-Freire.indd 8

(for example, “find experiments similar to mine”) and better understand existing tools and how they’re used. Furthermore, such collections let the community evaluate a contribution’s impact not only through the citations to a paper, but also through the use of the proposed software and data components. A lthough some repositories—such as nanoHub, myExperiment, and CrowdLabs— already cater to different aspects of reproducibility, they’re still in their infancy, and questions remain about what their architecture should be and even how they will be used.8

Acknowledgments We thank the VisTrails team for making this work possible, especially David Koop, Emanuele Santos, and Huy Vo. We also thank many collaborators and users who have provided insightful feedback on the design and implementation of our reproducibility infrastructure—in particular, Philippe Bonnet, Dennis Shasha, Joel Tohline, and Matthias Troyer. The Department of Energy (DOE) Office of Science, Biological and Environmental Research (BER) and the National Science Foundation partially supported this work under awards IIS-1139832, IIS- 1142013, IIS-0905385, IIS-1153728, CNS-1153503, and AGS-0835821.

References 1. D. Donoho et al., “Reproducible Research in Compu-

2.

3.

4.

5.

6.

7.

8.

tational Harmonic Analysis,” Computing in Science & Eng., vol. 11, no. 1, 2009, pp. 8–18. J. Freire et al., “VisTrails,” The Architecture of Open Source Applications, ch. 23, Lulu, 2011; www.aosabook. org/en/vistrails.html. M.H. Freedman et al., “Galois Conjugates of Topological Phases,” Physical Rev. B, vol. 85, no. 4, 2012; http://link.aps.org/doi/10.1103/PhysRevB.85.045414. P. Mates et al., “Crowdlabs: Social Analysis and Visualization for the Sciences,” Proc. 23rd Int’l Conf. Scientific and Statistical Database Management (SSDBM), Spring-Verlag, 2011, pp. 555–564. J. Freire et al., “Provenance for Computational Tasks: A Survey,” Computing in Science & Eng., vol. 10, no. 3, 2008, pp. 11–21. C. Silva, J. Freire, and S.P. Callahan, “Provenance for Visualizations: Reproducibility and Beyond,” Computing in Science & Eng., vol. 9, no. 5, 2007, pp. 82–89. E. Santos et al., “VisMashup: Streamlining the Creation of Custom Visualization Applications,” IEEE Trans. Visualization and Computer Graphics, vol. 15, no. 6, 2009, pp. 1539–1546. J. Freire, P. Bonnet, and D. Shasha, “Exploring the Coming Repositories of Reproducible Experiments: Challenges and Opportunities,” Proc. Very

Computing in Science & Engineering

6/8/12 10:41 AM

Large Databases (PVLDB), vol. 4, no. 12, 2011, pp. 1494–1497. 9. D. Koop et al., “Bridging Workflow and Data

Provenance Using Strong Links,” Proc. Scientific and Statistical Database Management Conf. (SSDBM), Springer-Verlag, 2010, pp. 397–415. 10. D. Koop et al., “The Provenance of Workflow Upgrades,” Provenance and Annotation of Data and Processes, LNCS 5272, Springer-Verlag, 2010, pp. 2–16. 11. D. Koop et al., “A Provenance-Based Infrastructure to Support the Life Cycle of Executable Papers,” Procedia Computer Science, vol. 4, 2011, pp. 648–657; http://dx.doi.org/10.1016/j.procs.2011.04.068. 12. S.P. Callahan et al., “Towards Provenance-Enabling ParaView,” Provenance and Annotation of Data and Processes, LNCS 5272, Springer-Verlag, 2008, pp. 120–127.

Juliana Freire is a professor of computer science and engineering at the Polytechnic Institute of New York

University. Her research interests include data management, large-scale information integration, Web mining, visualization, and provenance. Freire has a PhD in computer science from the State University of New York at Stony Brook. She’s a member of the ACM and IEEE. Contact her at [email protected]. Claudio T. Silva is a professor of computer science and engineering at the Polytechnic Institute of New York University. His research interests include visualization, geometr y processing, graphics, and highperformance computing. Silva has a PhD in computer science from the State University of New York at Stony Brook. He’s a member of IEEE, the ACM, Euro graphics, and Sociedade Brasileira de Matematica. Contact him at [email protected]. Selected articles and columns from IEEE Computer Society publications are also available for free at http://ComputingNow.computer.org.

NEW from COST-BENEFIT ANALYSIS OF QUALITY PRACTICES Robert T. McCann A general approach to monitoring and analyzing rework costs that provides explicit formulae to support an understanding of cost-benefit analysis results. ISBN 978-0-7695-4659-9 • 7” x 10” • 75 pp.

Order .PDF ($15 / $12 members):

http://bit.ly/AnO1JT Order Paperback ($19):

http://bit.ly/wHBAk4

July/August 2012

CISE-14-4-Freire.indd 9

9

6/8/12 10:41 AM

Parallelize JavaScript Computations with Ease - GitHub