-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.xml
217 lines (214 loc) · 20.2 KB
/
README.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<article lang="en">
<articleinfo>
<title>Swift Provenance Database</title>
</articleinfo>
<section id="_introduction">
<title>Introduction</title>
<simpara>Swift can be configured to gather and store provenance information about script executions. The following tools are available:</simpara>
<orderedlist numeration="arabic">
<listitem>
<simpara>
A set of scripts for extracting provenance information from Swift’s log files. The extracted data is imported into a relational database, currently PostgreSQL, where it can queried.
</simpara>
</listitem>
<listitem>
<simpara>
A query interface for provenance with a built-in query language called SPQL (Swift Provenance Query Language). SPQL is similar to SQL except for not having <literal>FROM</literal>-clauses and join expressions on the <literal>WHERE</literal>-clause, which are automatically computed for the user. A number of functions and stored procedures that abstract common provenance query patterns are available in both SPQL and SQL.
</simpara>
</listitem>
</orderedlist>
<simpara>The tools for managing provenance information in Swift have the following features:</simpara>
<itemizedlist>
<listitem>
<simpara>
Gathering of producer-consumer relationships between data sets and processes.
</simpara>
</listitem>
<listitem>
<simpara>
Gathering of hierarchical relationships between data sets.
</simpara>
</listitem>
<listitem>
<simpara>
Gathering of script source code used in each execution.
</simpara>
</listitem>
<listitem>
<simpara>
Allows users to enrich their provenance records with annotations.
</simpara>
</listitem>
<listitem>
<simpara>
Gathering of runtime information about application executions.
</simpara>
</listitem>
<listitem>
<simpara>
Provides a usable and useful query interface for provenance information.
</simpara>
</listitem>
</itemizedlist>
<simpara>A UML diagram of this provenance model is presented in figure <xref linkend="provdb_schema"/>. We simplify the UML notation to abbreviate the information that each annotated entity set (script run, function call, and variable) has one annotation entity set per data type. We define entities that correspond to the Open Provenance Model (OPM) notions of artifact, process, and artifact usage (either being consumed or produced by a process). Annotations, which can be added post-execution, represent information about provenance entities such as object version tags and scientific parameters.</simpara>
<informalfigure id="provdb_schema">
<mediaobject>
<imageobject>
<imagedata fileref="provdb.svg" contentwidth="1280"/>
</imageobject>
<textobject><phrase>Swift provenance database schema</phrase></textobject>
</mediaobject>
</informalfigure>
<simpara><literal>script</literal>: contains the script source code used and its hash value.</simpara>
<simpara><literal>script_run</literal>: refers to the execution (successful or unsuccessful) of a script, with attributes such as start time, source code filename, and Swift’s version.</simpara>
<simpara><literal>function_call</literal>: records calls to functions within a script execution. These calls take as input data sets, such as values stored in primitive variables or files referenced by mapped variables; perform some computation specified in the respective function declaration; and produce data sets as output. In Swift, function calls can represent invocations of external applications, built-in functions, and operators; each function call is associated with the script run that invoked it.</simpara>
<simpara><literal>app_fun_call</literal>: represents an invocation of an application function (<emphasis>app function</emphasis>). In Swift, it is generated by an invocation to an external application. External applications are listed in an application catalog along with the computational resources on which they can be executed.</simpara>
<simpara><literal>application_execution</literal>: represents execution attempts of an external application. Each application function call triggers one or more execution attempts, where one (or, in the case of retries or replication, several) particular computational resource(s) will be selected to actually execute the application.</simpara>
<simpara><literal>runtime_info</literal>: contains information associated with an application execution, such as resource consumption.</simpara>
<simpara><literal>dataset</literal>: represents data sets that were assigned to variables in a Swift script.</simpara>
<simpara><literal>annot</literal>: is a key-value pair associated with either a <literal>variable</literal>, <literal>function_call</literal>, or <literal>script_run</literal>. The annotations are free-form and can be used, for instance, to record scientific-domain parameters, object versions, and user identities.</simpara>
<simpara>The <literal>dataset_in</literal> and <literal>dataset_out</literal> relationships between <literal>function_call</literal> and <literal>variable</literal> define a lineage graph that can be traversed to determine ancestors or descendants of a particular entity. Process dependency and data dependency graphs are derived with transitive queries over these relationships.</simpara>
</section>
<section id="_design_and_implementation_of_swift_provenance_database">
<title>Design and Implementation of Swift Provenance Database</title>
<simpara>The Swift Provenance Database design is influenced by our survey of provenance queries in many-task computing. The <emphasis>multiple-step relationships</emphasis> (R^*) pattern is implemented by queries that follow the transitive closure of basic provenance relationships, such as data containment hierarchies, and data derivation and consumption. The <emphasis>run correlation</emphasis> (RCr) pattern is implemented by queries for correlating attributes from multiple script runs, such as annotation values or the values of function call parameters.</simpara>
<section id="_provenance_gathering_and_storage">
<title>Provenance Gathering and Storage</title>
<simpara>Swift can be configured to add both prospective and retrospective provenance information to the log file it creates to track the behavior of each script run. The provenance extraction mechanism processes these log files, filters the entries that contain provenance data, and exports this information to a relational SQL database. Each application execution is launched by a wrapper script that sets up the execution environment. We modified these scripts to also gather runtime information, such as memory consumption and processor load. Additionally, one can define a script that generates annotations in the form of key-value pairs, to be executed immediately before the actual application. These annotations can be exported to the provenance database and associated with the respective application execution. Swift Provenance Database processes the data logged by each wrapper to extract both the runtime information and the annotations, storing them in the provenance database. Additional annotations can be generated per script run
using <emphasis>ad-hoc</emphasis> annotator scripts. In addition to retrospective provenance, Swift Provenance Database keeps prospective provenance by recording the Swift script source code, the application catalog, and the site catalog used in each script run.</simpara>
</section>
<section id="_query_interface">
<title>Query Interface</title>
<simpara>During the Third Provenance Challenge, we observed that expressing provenance queries in SQL is often cumbersome. For example, such queries require extensive use of complex relational joins, for instance, which are beyond the level of complexity that most domain scientists are willing, or have the time, to master and write. Such usability barriers are increasingly being seen as a critical issue in database management systems. Jagadish et al. propose that ease of use should be a requirement as important as functionality and performance. They observe that, even though general-purpose query languages such as SQL and XQuery allow for the design of powerful queries, they require detailed knowledge of the database schema and rather complex programming to express queries in terms of such schemas. Since databases are often normalized, data is spread through different relations requiring even more extensive use of database join operations when designing
queries. Some of the approaches used to improve usability are forms-based query interfaces, visual query builders, and schema summarization.</simpara>
<simpara>translated into a SQL query that is processed by the underlying relational database. While the syntax of SPQL is by design similar to SQL, it does not require detailed knowledge of the underlying database schema for designing queries, but rather only of the entities in a simpler, higher-level abstract provenance schema, and their respective attributes.</simpara>
<simpara>The basic building block of a SPQL query consists of a selection query with the following format:</simpara>
<screen>select (distinct) selectClause
(where whereClause
(group by groupByClause
(order by orderByClause)))</screen>
<simpara>This syntax is very similar to a selection query in SQL, with a critical usability benefit: hide the complexity of designing extensive join expressions. One does not need to provide all tables of the from clause. Instead, only the entity name is given and the translator reconstructs the underlying entity that was broken apart to produce the normalized schema. As in the relational data model, every query or built-in function results in a table, to preserve the power of SQL in querying results of another query. Selection queries can be composed using the usual set operations: union, intersect, and difference. A <literal>select</literal> clause is a list with elements of the form <literal><entity set name>(.<attribute name>)</literal> or <literal><built-in function name>(.<return attribute name>)</literal>. If attribute names are omitted, the query returns all the existing attributes of the entity set. SPQL supports the same aggregation, grouping, set
operation and ordering constructs provided by SQL.</simpara>
<simpara>To simplify the schema that the user needs to understand to design queries, we used database views to define the higher-level schema presentation shown in Figure. This abstract, OPM-compliant provenance schema, is a simplified view of the physical database schema detailed in section. It groups information related to a provenance entity set in a single relation. The annotation entity set shown is the union of the annotation entity sets of the underlying database, presented in Figure. To avoid defining one annotation table per data type, we use dynamic expression evaluation in the SPQL to SQL translator to determine the required type-specific annotation table of the underlying provenance database.</simpara>
<simpara></simpara>
<itemizedlist>
<listitem>
<simpara>
<literal>ancestors(object_id})</literal> returns a table with a single column containing the identifiers of variables and function calls that precede a particular node in a provenance graph stored in the database.
</simpara>
</listitem>
<listitem>
<simpara>
<literal>data_dependencies(variable_id})</literal>, related to the previous built-in function, returns the identifiers of variables upon which <literal>variable_id</literal> depends.
</simpara>
</listitem>
<listitem>
<simpara>
<literal>function_call_dependencies(function_call_id})</literal> returns the identifiers of function calls upon which <literal>function_call_id</literal> depends.
</simpara>
</listitem>
<listitem>
<simpara>
<literal>compare_run(list of <function_parameter=string | annotation_key=string)</literal> shows how process parameters or annotation values vary across the script runs stored in the database.
</simpara>
</listitem>
</itemizedlist>
<simpara>The underlying SQL implementation of the <literal>ancestor</literal> built-in function, below, uses recursive Common Query Expressions, which are supported in the SQL:1999 standard. It uses the <literal>prov\_graph</literal> database view, which is derived from the <literal>dataset\_in</literal> and <literal>dataset_out</literal> tables, resulting in a table containing the edges of the provenance graph.</simpara>
<screen>CREATE FUNCTION ancestors(varchar) RETURNS SETOF varchar AS $$
WITH RECURSIVE anc(ancestor,descendant) AS
(
SELECT parent AS ancestor, child AS descendant
FROM prov_graph
WHERE child=$1
UNION
SELECT prov_graph.parent AS ancestor,
anc.descendant AS descendant
FROM anc, prov_graph
WHERE anc.ancestor=prov_graph.child
)
SELECT ancestor FROM anc $$ ;</screen>
<simpara>To further simplify query specification, SPQL uses a generic mechanism for computing the {em from} clauses and the join expressions of the <literal>where</literal> clause for the target SQL query. The SPQL to SQL query translator first scans all the entities present in the SPQL query. A shortest path containing all these entities is computed in the graph defined by the schema of the provenance database. All the entities present in this shortest path are listed in the <literal>from</literal> clause of the target SQL query. The join expressions of the <literal>where</literal> clause of the target query are computed using the edges of the shortest path, where each edge derives an expression that equates the attributes involved in the foreign key constraint of the entities that define the edge. While this automated join computation facilitates query design, it does somewhat reduce the expressivity of SPQL, as one is not able to perform other types of joins, such as self-joins, explicitly. However, many such queries can be expressed using subqueries,
which are supported by SPQL. While some of the expressive power of SQL is thus lost, we show in the sections that follow that SPQL is able to express, with far less effort and complexity, most important and useful queries that provenance query patterns require. As a quick taste, this SPQL query returns the identifiers of the script runs that either produced or consumed the file <literal>nr</literal>:</simpara>
<screen>select compare_run(parameter='proteinId').run_id where file.name='nr';</screen>
<simpara>This SPQL query is translated by Swift Provenance Database to the following SQL query:</simpara>
<screen>select compare_run1.run_id
from select run_id, j1.value AS proteinId
from compare_run_by_param('proteinId') as compare_run1,
run, proc, ds_use, ds, file
where compare_run1.run_id=run.id and ds_use.proc_id=proc.id and
ds_use.ds_id=ds.id and ds.id=file.id and
run.id=proc.run_id and file.name='nr';</screen>
<simpara>Further queries are illustrated by example in the next section. We note here that the SPQL query interface also lets the user submit standard SQL statements to query the database.</simpara>
</section>
</section>
<section id="_tutorial">
<title>Tutorial</title>
<simpara>Swift Provenance Database is a set of scripts, SQL functions and stored procedures, and a query interface. It extracts provenance information from Swift’s log files into a relational database. The tools are downloadable through SVN with the command:</simpara>
<screen>svn co https://svn.ci.uchicago.edu/svn/vdl2/provenancedb</screen>
<section id="_database_configuration">
<title>Database Configuration</title>
<simpara>Swift Provenance Database depends on PostgreSQL, version 9.0 or later, due to the use of <emphasis>Common Table Expressions</emphasis> for computing transitive closures of data derivation relationships, supported only on these versions. The file <literal>prov-init.sql</literal> contains the database schema, and the file <literal>pql_functions.sql</literal> contain the function and stored procedure definitions. If the user has not created a provenance database yet, this can be done with the following commands (one may need to add "<literal>-U</literal> <emphasis>username</emphasis>" and "<literal>-h</literal> <emphasis>hostname</emphasis>" before the database name "<literal>provdb</literal>", depending on the database server configuration):</simpara>
<screen>createdb provdb
psql -f prov-init.sql provdb
psql -f pql-functions.sql provdb</screen>
</section>
<section id="_swift_provenance_database_configuration">
<title>Swift Provenance Database Configuration</title>
<simpara>The file <literal>etc/provenance.config</literal> should be edited to define the database configuration. The location of the directory containing the log files should be defined in the variable <literal>LOGREPO</literal>. For instance:</simpara>
<screen>export LOGREPO=~/swift-logs/</screen>
<simpara>The command used for connecting to the database should be defined in the variable SQLCMD. For example, to connect to CI’s PostgreSQL? database:</simpara>
<screen>export SQLCMD="psql -h db.ci.uchicago.edu -U provdb provdb"</screen>
<simpara>The script <literal>./swift-prov-import-all-logs</literal> will import provenance information from the log files in <literal>$LOGREPO</literal> into the database. The command line option <literal>-rebuild</literal> will initialize the database before importing provenance information.</simpara>
</section>
<section id="_swift_configuration">
<title>Swift Configuration</title>
<simpara>To enable the generation of provenance information in Swift’s log files the option <literal>provenance.log</literal> should be set to true in <literal>etc/swift.properties</literal>:</simpara>
<screen>provenance.log=true</screen>
<simpara>If Swift’s SVN revision is 3417 or greater, the following options should be set in <literal>etc/log4j.properties</literal>:</simpara>
<screen>log4j.logger.swift=DEBUG
log4j.logger.org.griphyn.vdl.karajan.lib=DEBUG</screen>
<section id="_enriching_provenance_data_with_runtime_resource_consumption_statistics">
<title>Enriching Provenance Data with Runtime Resource Consumption Statistics</title>
<simpara>A modified version of <literal>_swiftwrap</literal> can be used to gather additional information on runtime resource comsumption, such as processor, memory, I/O, and swap use. One should backup the original <literal>_swiftwrap</literal> script and replace it with the modified one:</simpara>
<screen>cp $SWIFT_HOME/libexec/_swiftwrap $SWIFT_HOME/libexec/_swiftwrap-backup
cp swift_mod/_swiftwrap_runtime_snapshots $SWIFT_HOME/libexec/_swiftwrap</screen>
</section>
</section>
<section id="_example_modis">
<title>Example: MODIS</title>
<simpara>Run MODIS.</simpara>
<screen>swift modis.swift
swift-prov-import-all-logs</screen>
<simpara>Connect to the provenance database:</simpara>
<screen>psql provdb</screen>
<simpara>List runs that were imported to the database:</simpara>
<screen>SELECT script_filename, swift_version, cog_version, final_state, start_time, duration
FROM script_run;
script_filename | swift_version | cog_version | final_state | start_time | duration
--------------------+---------------+-------------+-------------+----------------------------+----------
modis.swift | 5746 | 3371 | FAIL | 2012-09-19 17:26:19.221-03 | 2.168
modis-vortex.swift | 5746 | 3371 | FAIL | 2012-09-19 17:28:24.809-03 | 180.542
modis-vortex.swift | 5746 | 3371 | FAIL | 2012-09-19 17:31:55.706-03 | 312.249</screen>
<screen>select * from ancestors('dataset:20120919-1731-06svjllb:720000000654');
ancestors
pass:[-----------------------------------------------------------]
modis-vortex-20120919-1731-6fa0kk03:0
modis-vortex-20120919-1731-6fa0kk03:0-6
dataset:20120919-1731-06svjllb:720000000335
dataset:20120919-1731-06svjllb:720000000653
dataset:20120919-1731-06svjllb:720000000007
modis-vortex-20120919-1731-6fa0kk03:06svjllb:720000000335
dataset:20120919-1731-06svjllb:720000000336
dataset:20120919-1731-06svjllb:720000000337
...
dataset:20120919-1731-06svjllb:720000000042
dataset:20120919-1731-06svjllb:720000000229
dataset:20120919-1731-06svjllb:720000000006
(958 rows)</screen>
</section>
</section>
</article>