End-to-End provenance representation for the understandability and reproducibility of scientific experiments using a semantic approach

In this section, we present our main results for the understandability, reproducibility, and reuse of scientific experiments using a provenance-based semantic approach. We first precisely define “Reproducibility” and the related terms used throughout this paper. We then present the REPRODUCE-ME Data Model and ontology for the representation of scientific experiments along with their provenance information.

Definitions

Reproducibility helps scientists in building trust and confidence in results. Even though different reproducibility measures are taken in different fields of science, it does not have a common global standard definition. Repeatability and Reproducibility are often used interchangeably even though they are distinct terms. Based on our review of state-of-the-art definitions of reproducibility, we precisely define the following terms [48] which we will use throughout this paper in the context of our research work inspired by the definitions [7, 49].

Definition 1

Scientific Experiment: A scientific experiment E is a set of computational steps CS and non-computational steps NCS performed in an order O at a time T by agents A using data D, standardized procedures SP, and settings S in an execution environment EE generating results R to achieve goals G by validating or refuting the hypothesis H.

Definition 2

Computational Step: A computational step CS is a step performed using computational agents or resources like computer, software, script, etc.

Definition 3

Non-computational Step: A non-computational step NCS is a step performed without using any computational agents or resources.

Definition 4

Reproducibility: A scientific experiment E composed of computational steps CS and non-computational steps NCS performed in an order O at a point in time T by agents A in an execution environment EE with data D and settings S is said to be reproducible, if the experiment can be performed to obtain the same or similar (close-by) results to validate or refute the hypothesis H by making variations in the original experiment E. The variations can be done in one or more of the following variables:

Definition 5

Repeatability: A scientific experiment E composed of computational steps CS and non-computational steps NCS performed in an order O at a point in time T by agents A in an execution environment EE with data D and settings S is said to be repeatable, if the experiment can be performed with the same conditions of the original experiment E to obtain the exact results to validate or refute the hypothesis H. The conditions which must remain unchanged are:

Definition 6

Reuse: A scientific experiment E is said to be reused if the experiment along with the data D and results R are used by a possibly different experimenter A ′ in a possibly different execution environment EE ′ but with a same or different goal G ′.

Definition 7

Understandability: A scientific experiment E is said to be understandable when the provenance information (What, When, Where, Who, Which, Why, How) and the metadata used or generated in E are presented to understand the data D and results R of E by a possibly different agent A ′.

Understandability of scientific experiments is objective as the metadata is defined by the domain-specific community.

The REPRODUCE-ME data model and ontology

The REPRODUCE-ME Data Model [50, 51] is a conceptual data model developed to represent scientific experiments with their provenance information. Through this generic data model, we describe the general elements of scientific experiments for their understandability and reproducibility. We collected provenance information from interviews and discussions with researchers from different disciplines and formulated them in the form of competency questions as described in the Methods section. The REPRODUCE-ME Data Model is extended from PROV-O [23] and P-Plan [33] and inspired by provenance models [15, 16].

Figure 2 presents the overall view of the REPRODUCE-ME data model to represent a scientific experiment.

Fig. 2figure2

The expanded view of the REPRODUCE-ME data model used to represent a scientific experiment

The central concept of the REPRODUCE-ME Data Model is an Experiment. The data model consists of eight main components. They are Data, Agent, Activity, Plan, Step, Setting, Instrument, and Material.

Definition 8

Experiment is defined as a n-tuple E = (Data, Agent, Activity, Plan, Step, Setting, Instrument, Material)

where E is the Experiment; Data denotes the set of data used or generated in E; Agent, the set of all people or organizations involved in E; Activity, the set of all activities occurred in E; Plan, the set of all plans involved in E; Step, the set of steps performed in E; Setting, the set of all settings; Instrument, the set of all devices used in E and Material, the set of all physical and digital materials used in E.

We define each of the components of the model in detail. The definitions of the classifications of each component of the model are available in the documentation of the REPRODUCE-ME ontology [52] (see Additional file 1).

Definition 9

Data represents a set of items used or generated in a scientific experiment E.

Data is a key part of a scientific experiment. Though it is a broad concept, we need to narrow it down to specific details to model a scientific experiment for reproducibility or repeatability. Hence, the REPRODUCE-ME data model further categorizes the data. However, different instances of the same data type can belong to different categories. For example, an instance of a Publication from which a method or an algorithm is followed is annotated as Input Data and another instance of Publication could be annotated as Result of a study. We model Data as a subtype of Entity defined in the PROV data model. We classify the data as follows:

We use PROV-O classes and properties like wasDerivedFrom, specializationOf, wasRevisionOf, PrimarySource to describe the provenance of data, especially the transformation and derivation of entities.

Definition 10

Agent represents a group of people/organizations associated with one or many roles in a scientific experiment E.

Every agent is responsible for one or multiple roles associated with activities and entities of an experiment. For the reproducibility of scientific experiments, it is important to know the agents and their roles. However, it would be less significant for some experiments or disciplines to know all the agents involved. For example, the name of the manufacturer or distributor of a chemical substance/device is important in a life science experiment while it is less relevant for a computer scientist. Here, we present a list of relevant agents [54] based on our requirements directly or indirectly associated with a scientific experiment:

Experimenter

Manufacturer

Copyright Holder

Distributor

Author

Principal Investigator

Contact Person

Owner

Organization

Research Project

Research Group

Funding Agency

Definition 11

Activity represents a set of actions where each action has a starting and ending time which involves the usage or generation of entities in a scientific experiment E.

Activity is mapped to Activity in the PROV-DM model. It represents a set of actions taken to achieve a task. Each execution or the trials of an experiment is considered as an activity that is also considered necessary to understand the derivation of the final output. Here we present the important attributes of activities of an experiment.

Execution Order: The order of execution plays a key role in some systems and applications. For example, in a Jupyter Notebook, the order of the execution will affect the final and intermediate results because the cells can be executed in any order.

Difference of executions: It represents the variation in inputs and the corresponding change in outputs in two different experiment runs. For example, two executions of a cell in a Jupyter Notebook can provide two different results.

Prospective Provenance: It represents the provenance information of an activity that specifies its plan. e.g., a script.

Retrospective Provenance: It represents the provenance information of what happened when an activity is performed. e.g., version of a library used in a script execution.

Causal Effects: The causal effects of an activity denotes the effects on an outcome because of another activity. e.g., non-linear execution of cells in a notebook affects its output.

Preconditions: The conditions that must be fulfilled before performing an activity. e.g., software installation prerequisites.

Cell Execution: The execution/run of a cell of a computational notebook is an example of an activity.

Trial: The various tries of an activity. For example, several executions of a script.

Definition 12

Plan represents a collection of steps and actions to achieve a goal.

The Plan is mapped to Plan in the PROV-DM and P-Plan model. Here, we classify the Plan as follows: (a) Experiment, (b) Protocol, (c) Standard Operating Procedure, (d) Method, (e) Algorithm, (f) Study, (g) Script, (h) Notebook.

Definition 13

Step represents a collection of actions that represents the plan for an activity.

A Step represents a planned execution activity and is mapped to Step in the P-Plan model. Here, we categorize Step as follows: (a) Computational Step, (b) Non-computational Step, (c) Intermediate Step, (d) Final Step.

Definition 14

Setting represents a set of configurations and parameters involved in an experiment.

Here, we categorize the Settings as follows: (a) Execution Environment, (b) Context, (c) Instrument Settings, (d) Computational Tools, (e) Packages (f) Libraries, (g) Software.

Definition 15

Instrument represents a set of devices used in an experiment.

In our approach, we focused onto the high-end light imaging microscopy experiments. Therefore, we add the terms which are related to microscopy to include domain semantics. The Instrument can be extended based on the requirements of an experiment. Here, we categorize the Instruments as follows: (a) Microscope, (b) Detector, (c) LightSource, (d) FilterSet, (e) Objective, (f) Dichroic, (g) Laser. This component could easily be extended to instruments from other domains.

Definition 16

Material represents a set of physical or digital entities used in an experiment.

We model the Material as a subtype of Entity defined in the PROV data model. Here, we provide some of the materials related to life sciences which are added in the data model: (a) Chemical, (b) Solution, (c) Specimen, (d) Plasmid. This could easily be extended to materials from other domains.

The REPRODUCE-ME ontology

To describe the scientific experiments in Linked Data, we developed a ontology based on the REPRODUCE-ME Data Model. The REPRODUCE-ME ontology, which is extended from PROV-O and P-Plan, is used to model the scientific experiments in general irrespective of their domain. However, it was initially designed to represent scientific experiments taking into account the life sciences and in particular high-end light microscopy experiments [50]. The REPRODUCE-ME ontology is available online along with the documentation [52]. The ontology is also available in Ontology Lookup Service [55] and BioPortal [53].

Figure 3 shows an excerpt of the REPRODUCE-ME ontology depicting the lifecycle of a scientific experiment. The class Experiment which represents the scientific experiment conducted to test a hypothesis is modeled as a Plan. Each experiment consists of various steps and sub plans. Each step and plan can either be computational or non-computational. We use the object property p-plan:isStepOfPlan to model the relation of a step to its experiment and p-plan:isSubPlanOfPlan to model the relation of a sub plan to its experiment. The input and output of a step are modelled as p-plan:Variable which are related to the step using the properties p-plan:isInputVarOf and p-plan:isOutputVarOf respectively. The class p-plan:Variable is used to model each data element. For example, Image is an output variable of the Image Acquisition step which is an integral step in a life science experiment involving microscopy. The Publication is modeled as ExperimentData which in turn is a p-plan:Variable and prov:Entity. Hence, it could be used as an input or output variable depending on whether it was used or generated in an experiment. We use the properties doiFootnote 1, pubmedidFootnote 2, and pmcidFootnote 3 to identify the publications.The concepts Method, Standard Operating Procedure and Protocol, which are modeled as Plan are added to describe the methods, standard operating procedures and protocols respectively. These concepts are linked to the experiment using the property p-plan:isSubPlanofPlan. The relationship between a step of an experiment and the method is presented using the object property usedMethod. The concepts ExperimentalMaterial and File are added as subclasses of a prov:Entity and p-plan:Variable. A variable is related to an experiment using the object property p-plan:correspondsToVariable. We could model the steps and plans and their input and output variables in this manner.

Fig. 3figure3

A scientific experiment depicted using the REPRODUCE-ME ontology [56]

The role of Instruments and their settings are significant in the reproducibility of scientific experiments. The Instrument is modeled as a prov:Entity to represent the set of all instruments or devices used in an experiment. The configurations made in an instrument during the experiment is modeled as Settings. The parts of each Instrument are related to an Instrument using the object property hasPart and inverse property isPartOf. Each instrument and its parts have settings that are described using the object property hasSetting.

The agents responsible for an experiment are modeled by reusing the concepts of PROV-O. Based on our requirements to model agents in life-science experiments, we add additional specialized agents as defined in the REPRODUCE-ME Data Model to represent the agents directly or indirectly responsible for an experiment. We use the data property ORCID [57] to identify the agents of an experiment. We reuse the object and data properties of PROV-O to represent the temporal and spatial properties of a scientific experiment. The object property prov:wasAttributedTo is used to relate the experiment with the responsible agents. The properties prov:generatedAtTime and modifiedAtTime are used to describe the creation and modification time respectively.

To describe the complete path of a scientific experiment, it is important that the computational provenance is semantically linked with the non-computational provenance. Hence, in the REPRODUCE-ME ontology, we add the semantic description of the provenance of the execution of scripts and computational notebooks [58]. These are then linked with the non-computational provenance. We add the provenance information to address the competency question “What is the complete derivation of an output of a script or a computational notebook?”. Therefore, we present the components that we consider important in the reproducibility of scripts and notebooks to answer this question. Table 1 shows the components, their description and the corresponding terms that are added in the REPRODUCE-ME ontology to represent the complete derivation of scripts and notebooks. These terms are classified into prospective and retrospective provenance. The specification and the steps required to generate the results is denoted by prospective provenance. What actually happened during the execution of a script is denoted by retrospective provenance. We use each term to semantically describe the steps and sequence of steps in the execution of a script and notebook in a structured form using linked data without having to worry about any underlying technologies or programming languages.

Table 1 Overview of the ontology terms to model script and computational notebooks provenance

As shown in Table 1, the function definitions and activations, the script trials, the execution time of the trial (start and end time), the modules used and their version, the programming language of the script and its version, the operating system where the script is executed and its version, the accessed files during the script execution, the input argument and return value of each function activation, the order of execution of each function and the final result are used to describe the complete derivation of an output of a script.

The provenance of a computational notebook and its executions are depicted using the REPRODUCE-ME ontology in Fig. 4. The Cell is a step of Notebook and this relationship is described using p-plan:isStepOfPlan. The Source is related to Cell using the object property p-plan:hasInputVar and its value is represented using the property rdf:value. Each execution of a cell is described as CellExecution which is modeled as a p-plan:Activity. The input of each Execution is an prov:Entity and the relationship is described using the property prov:used. The output of each Execution is an prov:Entity and the relationship is described using the property prov:generated. The data properties prov:startedAtTime, prov:endedAtTime, and repr:executionTime are used to represent the starting time, ending time and the total time taken for the execution of the cell respectively.

Fig. 4figure4

The semantic representation of a computational notebook [59]

To sum up, the REPRODUCE-ME ontology describes the non-computational and computational steps and plans used in an experiment, the people who are involved in an experiment and their roles, the input and output data, the instruments used and their settings, the execution environment, the spatial and temporal properties of an experiment to represent the complete path of a scientific experiment.

Evaluation

In this section, we apply the traditional ontology evaluation method by answering the competency questions through the execution of SPARQL queries. All questions mentioned in the Methods section could be answered by running the SPARQL queries over the provenance collected in CAESAR [56]. CAESAR (CollAborative Environment for Scientific Analysis with Reproducibility) is a platform for the end-to-end provenance management of scientific experiments. It is a software platform which is extended from OMERO [47]. With the integration of the rich features provided by OMERO and provenance-based extensions, CAESAR provides a platform to support the understandability and reproducibility of experiments. It helps scientists to describe, preserve and visualize their experimental data by providing the linking of the datasets with the experiments along with the execution environment and images [56]. It also integrates ProvBook [59], which captures and manages the provenance information of the execution of computational notebooks. We present here three competency questions with the corresponding SPARQL queries and part of the results obtained on running them against the knowledge base in CAESAR. The knowledge base consists of 44 experiments recorded in 23 projects by the scientists from the CRC ReceptorLight. The total size of the datasets including experimental metadata and images amount to 15GB. In addition to that, it consists of 35 imaging experiments from the IDR datasets [60]. The knowledge base consists of around 5.8 million triples. In our first question to get all the steps involved in an experiment which used a particular material, we showcase the answer using a concrete example, namely steps involving the Plasmid ‘pCherry-RAD54’. The corresponding SPARQL query and part of the results are shown in Fig. 5. As seen from Fig. 5, 2 experiments (Colocalization of EGFP-RAD51 and EGFP-RAD52 / mCherry-RAD54) use the Plasmid ‘pCherry-RAD54’ in the two different steps (‘Preparation’ and ‘Transfection’). The response time for this SPARQL query is 94ms.

Fig. 5figure5

The steps involved in an experiment which used the Plasmid ‘pCherry-RAD54’

The SPARQL query to answer the competency question ‘What is the complete path taken by a user for a computational notebook experiment’ and part of the results are shown in Fig. 6. The response time for this SPARQL query is 12ms.

Fig. 6figure6

Complete path taken by a scientist for a computational notebook experiment: The corresponding SPARQL query and a part of results

The SPARQL query to answer the competency question ‘What is the complete path taken by a user for a scientific experiment’ and its parts of results are shown in Fig. 7. This SPARQL queries a particular experiment called ’Focused mitotic chromosome condensation screen using HeLa cells’ with its associated agents and their role, the plans and steps involved, the input and output of each step, the order of steps, and the instruments and their setting. The results show that this query helps in getting all the important elements required to describe the complete path of an experiment. The experiment is linked to the computational and non-computational steps. It is possible that the query can be further expanded to query for all the elements mentioned in the REPRODUCE-ME Data Model. The response time for this SPARQL query is 824ms.

Fig. 7figure7

Complete path taken by a scientist for an experiment: The corresponding SPARQL query and a part of results

留言 (0)

沒有登入
gif