Collaborative Research: Citing Structured and Evolving Data

Viewer
Transcript

III: Medium: Collaborative Research: Citing Structured and Evolving Data

NSF division: Information and Intelligent Systems, Directorate for Computer & Information Science & Engineering NSF 12-580 PI, Susan Davidson (CIS, University of Pennsylvania, [email protected]) co-PI, Peter Buneman (CIS, University of Pennsylvania/ School of Informatics, University of Edinburgh, [email protected]) co-PI, Val Tannen (CIS, University of Pennsylvania, [email protected]) PI, James Frew (Earth Research Institute, University of California, Santa Barbara, [email protected]) Senior personnel, Wenfei Fan (CS, School of Informatics, University of Edinburgh, [email protected])

Project Description 1

Introduction

Citation is perhaps the most fundamental tool of scientific research and, more generally, scholarship. It is what we traditionally have used for verification, and it is essential to the trust we place in scientific work. For better or worse, it is one of the main tools we have for assessing academic reputation. The advent of the internet and our ability to place both academic papers and data on the Web has made it much easier to find and create citations, and has greatly increased the number of citations per paper – however that is measured. One of the problems is that a vast amount of data is published on the Web, but we do not have adequate tools to cite it; especially when the data is in some form of database. The importance of citation to data has been recognized in the large number of organisations [41, 94, 42, 74, 52] which have attempted to describe the structure of data citations. However these tend to follow tradition, and assume that the “digital objects” being cited are fixed and that the structure of the object is irrelevant to the citation. The goal of this research is to understand the problem of citing databases, by which we mean anything that has internal structure or is subject to change. This characterizes a large number, perhaps the majority, of scientific repositories: they all have some internal structure and they nearly all change over time. It is unrealistic to expect a database administrator to manually create all possible citations to a large database, especially if the citations are going to refer to data elements at some fine degree of granularity. It is equally unrealistic to expect that a person wanting to cite some part of a large database should be able to figure out how to do it. We therefore need some computational mechanism for generating citations, and this is what this proposal is about: how to generate citations to databases and how to make the databases themselves citable. As far as we are aware, no-one else has tackled this problem. In preparing this proposal, we have discussed the need for citations with a number of people who manage on-line databases and data-sets to find out why database citations are needed and what form they should take1 . The reasons why they are needed are mostly obvious, but it helps to re-iterate them here. Retrieval: This is the most basic requirement of a citation: one wants a mechanism for retrieving the cited material. Reputation: Just as scholars and scientists use citation counts as a means to quantify academic reputation, the creators of valuable data sources should benefit from what is essentially a publication. In addition there are many organizations dedicated to maintaining and publishing data. A citation count is a better justification than a hit-count of a Web-site. In some cases, citation is a “carrot” offered by data centers to convince scientists to publish their data. Responsibility: Knowing who holds responsibility for, or ownership of, the cited data is also enables one to know who is to be consulted on issues of intellectual property or privacy. Human identification: Persistent object identifiers enable us to retrieve a digital resource, but these tend to be opaque; authorship and title are what we commonly use for identification. In fact, even traditional “persistent identifiers” are opaque: the citations Ann. Phys., 18 639-641 and Nature, 171, 737-738 are perfectly good persistent identifiers, but may not be immediately recognised by the reader – who almost certainly knows of these papers [60, 103]. Repeatability: This is specific to data citation and important to this proposal. If the cited data was used as input to some scientific workflow, one would like the relevant references to the data to be part of the citation and to be machine readable, so that the provenance is explicit and the workflow can be checked, if needed. We need to design citation mechanisms that satisfy these requirements as well as satisfying the normal database requirements of being efficient and robust under change. 1 Many thanks to: Micah Altman, Institute for Quantitative Social Science; Kevin Ashley, Digital Curation Centre; Sarah Callaghan, British Atmospheric Data Centre; Peter Burnhill, Edina; Tanvi Desai, London School of Economics; John Kunze, California Digital Library; Mark Liberman, Linguistic Data Consortium, University of Pennsylvania; Nigel Shadbolt, University of Southampton and data.gov.uk.

1

Peristent References

Citable units

Data references 11111 00000 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111

1111 0000 0000 1111 0000 1111 0000 00001111 1111 0000 1111 00001111 1111 0000 1111 0000 1111 00000000 1111 0000 1111 00001111 1111 0000

Figure 1: The Citation Hierarchy

Background We are going to look more closely at the traditional structure of a citation, but we should first note the important distinction between the surface syntax – the presentation of a citation and its content. Wikipedia[104] lists 14 citation style guides; some 390 styles appear to be available for BibTeX [24]. However, these styles are forms of presentation of some independently defined content [59]. For example, [96] gives an extensive list of fields (content) that should or could be included in a data citation. It also gives an example of the presentation (surface syntax) of that content in XML. This is machine readable so that generating citations in other styles is relatively easy. In principle, the presentation could be made in any of a number of formats (JSON, ASN.1 etc.) What is interesting is that [96] also provides a schema (in XML-Schema) for the specific citation format, and it is this schema that, in part constrains the content and structure of a citation, but allows additional content and puts increasing responsibility on the authors and publishers of data to decide on what this should be. In all the databases we have looked at, there is a hierarchical structure within which the data is organized. It is either a physical structure or a logical structure that is inherent to semantics of the data. It is common, for example, to have a (hierarchical) file system in which the files are in the hierarchical formats listed above. Even when the underlying database is relational (as in one of our test databases, IUPHAR-DB), the presentation or logical structure – the organization of Web pages – is hierarchical and the citations naturally follow the presentation hierarchy, not the table/tuple/value hierarchy of a relational database. Given the existence of such a hierarchy, we introduce three notions that are basic to the study of data citations. Persistent References. Persistent identifiers (Digital Object Identifiers, Archival Resource Keys, Uniform Resource Identifiers, etc.) are intended as persistent mechanisms for locating and retrieving data. They provide a partial mechanism for that most important aspect of a citation, but per se do not provide us with the other content one expects in a citation: authorship, title, date, ownership etc. It is useful to divide the content of a citation into location information, by which me mean information designed to expedite retrieval and descriptive information. We will use the term persistent reference to describe any node located by a persistent identifier. Citable Units. Single data values – the lowest level in the hierarchy – are not usually regarded as citable per se; similarly, the whole hierarchy of data may not be citable. What is generally regarded as citable is something that – like a publication – has some kind of integrity, defines some kind of context, and for which one can define attributes such as title and authorship. Moreover, just as we can cite a collection of papers as well as the individual papers, so, in a hierarchy, we need to have citations at various levels. We will refer to the nodes that can be cited as citable units. As an example, consider the two citations: “IUPHAR-DB (C1 ) contains no information about ginandtonicin receptors” and “IUPHAR-DB (C2 ) asserts that luzindole is an antagonist at MT1 .” C1 is to the whole database and would not carry author information. The structure of C2 would be somewhat different; it would contain, for example a list of contributors, responsible for that part of the database. Data references. In conventional citations we often add information to specify location within the citable units. In a citation such as “It is well-attested that the moon is made of green cheese (Bloggs, A.J. The Convolution of Reality. Elspringer (1977) p67)” we understand the citable unit to be that determined by “Bloggs . . . (1977)” and “p67” to be extra information, which we shall call a data reference that will help the reader find where this claim is made. More importantly, machine-readable data references are an essential part of provenance tracking, and they require the database to be properly archived. So we now have three concepts: persistent references, citable units and data references, all of which denote

2

nodes in a hierarchy. How are they related? Figure 1 is an informal description of the relationships. First, data references are not themselves citable units but must occur within (underneath) some citable unit. Second, any citable unit should occur underneath a node located through some persistent reference. Should each citable unit have a persistent reference and conversely? There does not seem to be any agreement on this. In a database that changes every hour, it is hard to imagine generating persistent identifier with each version. We do not judge this, but provide a framework in which citable units are not tied to persistent references.

Proposed Work The focus of this proposal is to develop a framework for data citation which takes into account the following issues: (1) the potentially very large number of possible citations; (2) the fact that citations should be both human and machine readable; and (3) the need for citations to conform to specifications prescribed by both the publishers of the data and by the various standards that are being established. All these give rise to interesting computational challenges: citations must be generated automatically from the data; the source data must be guaranteed to support the generation of these citations; and the generated citations must be guaranteed to conform to the specifications. Of course, as with any computational problem, all this must be done efficiently. This framework will include: 1) A rule-based language that can operate over a variety of hierarchical structures (including file systems, XML, netDCF, JSON, etc); 2) Techniques for validating that citation rules are correct, e.g. are unambiguous and uniquely identify data objects; and 3) Techniques for incrementally validating citation rules as the dataset evolves. The research questions to be addressed in this framework are detailed in Section 3. The framework will be implemented in a citation system, the architecture of which is described in Section 4. The research questions and citation system will be informed by, and evaluated with respect to, two particular scientific datasets, IUPHAR-DB and datasets associated with the Earth System Science Server (ES3) project, which are described in Section 2. The evaluation will include usability from the perspective of both authors and citers. Details of the evaluation and the work plan can be found in Section 5.

Previous Related Research by Investigators Davidson, Tannen, Buneman and Fan have significant expertise in several areas of research related to this proposal. In particular, citation is one form of expressing provenance: Buneman was one of the initiators of the study of data provenance, and has published widely in it as well as data archiving and annotation [35, 34, 26, 29, 27, 81]; Davidson has done extensive research on workflow provenance and privacy issues associated with provenance [39, 14, 45, 98, 15, 46, 10]; Tannen has laid a foundation for database provenance based on semi-rings [71, 64, 78, 4, 3, 5]; and Frew has extensive experience in the application of provenance to Earth science computing. Frew has developed two provenance capture systems [65, 67] and integrated them into Earth science computing environments [57, 58]; he is currently exploring the use of provenance to inform publication decisions [66]. The PIs also have extensive expertise in constraints: Buneman, Davidson and Fan were the initiators of the study of constraints for semistructured data and XML, and have published widely in that field [30, 23, 44]; Tannen has also studied constraints in XML [53, 92, 54, 55, 56]. Fan has since established an extensive research record in constraints for hierarchical data and XML [63, 62], and is an expert on incremental checking of constraints. In particular, Fan has proposed an elegant method for transforming relational databases into XML in such a way that satisfaction of constraints is guaranteed [12, 16]. These ideas have already been used experimentally in IUPHAR-DB, one of the two databases that will be used as test cases for this work. Tannen also has extensive experience with RDF [79, 95, 80, 83]. Of particular importance to this project, Buneman has extensive experience in curated databases. Together with his students, he has developed techniques for preserving of the history of databases that evolve in content and structure [87]; note that it is essential for cited material to be preserved [27]. He was one of the founders of the UK Digital Curation Centre, and is well known in both the database and digital library communities. It was through his interactions with Tony Harmar and the curators of IUPHAR-DB that the issue of data citation was first brought to our attention. In 2006, as a result of this collaboration, Buneman proposed some initiall ideas for automatically generating citations [25]. That work did not address the

3

general problem of constraints, nor did it deal with data references (described in this proposal); but the idea is a starting point for what is described here. Intellectual merit: Although the goal of this proposal is extremely practical, there are several research challenges to be met if we are to achieve a robust and generic solution. These fit well into the mainstream of recent database research as well as opening up some new questions about the structure of linked open data. The research challenges are described in detail in Section 3 and are summarized here. First, the work on checking the correctness of citation specifications (Section 3.2) asks new questions about type-checking XML queries and transformers [62, 86]. Second, the proposed work on incremental checking when schema information is missing or inadequate, extends work on incremental checking [93], but has not been fully explored in the context of key constraints. It also requires us to extend existing database archiving techniques [34]. The problem of generating a suitable citation set for a general query is closely related to the issue of reconciling workflow and data provenance, which has not been completely resolved despite recent useful contributions in that direction [3]. Finally, the problem of imposing structure on RDF so that it is citable is one that might be approached by the use of “named graphs” [102], but it is not at all clear how these proposals yield anything more than a simple partition of the data. Providing a flexible system for extracting and imposing more complex structures on RDF appears to be a new research direction. Broader impact: The proposed research will directly impact two scientific datasets – IUPHAR-DB and the Earth System Science Server (ES3) project. More generally, the ideas of the proposal will have tremendous impact in curated, digital datasets, of which there are an increasing number within science and other application domains. More specifically the proposed work will impact: 1) Scientists who publish their findings in organized data collections or databases or scientists; 2) Data centers that are charged with the task of publishing and preserving data for an organization, discipline or other enterprise; 3) Businesses and government agencies that provide on-line reference works, such as encyclopedias, gazetteers, business reports, etc. as part of their business; 4) Standards organizations and the like that are trying to formulate principles for data citations; and 5) Any scientist or scholar who wants to – and should be able to – give attribution to data they find on the Web. In these domains, there is a pressing demand by contributors to have the data they have produced properly cited, and there are numerous specifications for the structure and format of citations. However, there is no general computational approach to solving these problems. For 1-3 we hope to automate the process of generating citation, so that citations are generated automatically from the data, and no intervention is needed if there is any augmentation or modification to the cited data. For 4, there are a number of organizations such as DataCite, SageCite, The Dataverse Network, SigDC, and the Digital Curation Centre (DCC) for which data citation is a key issue. Co-PI Buneman is one of the founders of the DCC and maintains close links with it. We will work with the DCC and other organizations to disseminate results of this research; in particular, we will organize workshops to gain input from those organizations and other stakeholders. The budget includes funds for one such workshop; another may be funded by the Digital Curation Centre at the University of Edinburgh, and they have expressed interest in doing so. The workshops will be used to disseminate the ideas of the proposal and to get feedback from scientists. The PIs will also incorporate results of this work in graduate level database and bioinformatics courses that they are involved with at the University of Pennsylvania, and will pursue giving other tutorials and workshops in the digital curation/ digital libraries communities.

2

Test Cases

We now describe the two databases which form our test cases, IUPHAR and the ES3 datasets.

2.1

The IUPHAR database

IUPHAR-DB ( http://www.iuphar-db.org) is an open access database providing information on medicinal and experimental drugs and their targets in the body. Its aim is to provide an authoritative global resource for students, scientists in industry and academia and for the interested public; the database receives about 10,000 visits from over 100 countries each month. It is regarded as authoritative because it is contributed and

4

peer-reviewed by NC-IUPHAR (the International Union of Basic and Clinical Pharmacology Committee on Receptor Nomenclature and Drug Classification) and its network of over 60 expert subcommittees. IUPHARDB synthesizes the work and expertise of hundreds of experts worldwide. The problem of database citation was first brought to the attention of Peter Buneman by the developer of IUPHAR-DB, Antony Harmar, who is Professor of Pharmacology in the University of Edinburgh and ViceChairman of the International Union of Basic and Clinical Pharmacology (IUPHAR) Committee on Receptor Nomenclature and Drug Classification (NC-IUPHAR). The need for citation was simple: each section of NCIUPHAR is compiled by a different set of authors/contributors. Each section is about a specific receptor family: it contains an introduction, which is text, and then a subsection of annotated tabular data for each receptor in the family. The task of writing and assembling the data – as well as keeping it up to date – is substantial, and the contributors deserve recognition for their efforts, just as they would expect it had they published a chapter in a printed reference manual. In fact, it is quite straightforward to convert extracts from IUPHAR-DB into some markup language and print it as a conventional book. One could convert the entire database, but the book would amount to thousands of pages. Importance of citation generation to IUPHAR. Discussions with Harmar and his colleagues prompted the formulation of a method of specifying and automatically generating citations to curated databases described in [?]. The first point to make is that it would be almost impossible for the curators to create manually every citation that is needed. The second observation was that citations to whole database, citations to individual receptors and citations to the introductions to receptor familes are all different. Third, the database is frequently updated, and one does not want to check and possibly rewrite the citations each time this happens. Finally, it is highly unlikely that the users of this database would easily figure out how to cite it even if they were given some general specification of citations. The proposal in [25] did not go far enough. It did not deal with location information (described later), little attention was given to efficiency, and no attention was given to incremental checking. Despite this, a partial system was implemented in IUPHAR and can be seen by visiting the Web site http://www.iuphar-db.org. Notably missing from this is any attempt to preserve the cited material. Why IUPHAR is interesting. Although the user interface to IUPHAR-DB is a hierarchically structured set of Web pages, the underlying support comes from a relational database. The individual pages are generated, on the fly, from this database. While the physical structure of many scientific data collections is hierarchical, IUPHAR-DB, along with other curated databases uses a different internal representation. The challenge here is to make the citation system work properly with the relational-to-hierarchical transformation. Of course, the transformation is already specified by the code that creates the web pages, but when it comes to large collections of semistructured data such as RDF, it may be up to the people that specify the citation system to specify the hierarchical (or other) transformation into a citable format. This is one of the long-term challenges of the proposed research.

2.2

ES3 datasets

ES3 is a software system that automatically captures the provenance of arbitrary computational processes by monitoring their system-level interactions [67]. ES3’s approach to provenance capture has been validated by using it to monitor the generation and evolution of multiple Earth science datasets in mixed research/production environments (e.g., [57].) Two particular datasets will be used in the proposed research. Global ocean color dataset. The GSM [84] product suite comprises daily, weekly, and monthly averages of ocean color parameters, at 4 and 9 km spatial resolutions, over the entire globe. For each grid cell, concentrations of chlorophyll, particulates, and dissolved and detrital organic matter are calculated, along with their respective uncertainties. The GSM algorithm calculates the ocean color parameters that best match the observed reflectance spectrum of each MODIS pixel, using a function derived statistically from extensive satellite and field observations. Because the GSM products are generated at fixed spatiotemporal resolutions, their default publication granularity is to map single days, weeks, or months at a given spatial resolution into a single files2 . However, many users of the product are only interested in specific subsets of the global ocean, so the data products 2 ftp://ftp.oceancolor.ucsb.edu/pub/org/oceancolor/MEaSUREs/

5

are also distributed via an OPeNDAP server3 . OPeNDAP [40] is a web service protocol, widely used in the oceanographic community, that allows clients to request arbitrary subsets of multidimensional datasets. Alpine snow cover dataset. The MODSCAG [89] products comprise daily observations of fractional-pixel snow cover, at 500 m spatial resolution, over selected alpine regions (e.g, the Sierra Nevada in California, the Hindu Kush in Afghanistan, etc.) For each grid cell, the fractions of the cell covered by snow, vegetation, bare soil, and bare rock are calculated, along with an overall RMS error. The MODSCAG algorithm selects the linear combination of library spectra of each of these four components that best matches the observed reflectance spectrum of each MODIS pixel. MODSCAG product generation is more ad-hoc than GSM; partly because the product’s spatiotemporal coverage is more variable (only small discontiguous subsets of the Earth’s surface, and only during times when snow is present); and partly because the MODSCAG algorithm is less mature. The default publication granularity is a single day’s observation over a specific mountain range. Importance of citation to the ES3 datasets. The issue of citation granularity is raised by both products. If the purpose of a citation is to give appropriate intellectual credit, then a single citation to the entire dataset is sufficient. However, there are predictable changes that a dataset-level citation must track. The most common of these is the notion of version, which is a shorthand for a parametric change in the way the dataset is generated. In the case of our two sample datasets, the version tracks both changes in the algorithms, and changes in the source products—the MODIS imagery is periodically recalibrated, itself undergoing a version change. It must also be possible to cite specific granules of a dataset. In the case of the MODSCAG product, an analysis will typically be performed over a specific region, whose definition is reasonably constant and thus amenable to a persistent citation. For the GSM product, a standard combination of data source, time averaging scheme, date, and ocean color parameter serves to identify individual granules, encoded into each granule name when a granule is retrieved as a file. However, granule-level citation potentially breaks down when the data are retrieved through a service like OPeNDAP that permits arbitrary subsetting. A granule-level citation in this case must include (or refer to) the query that produced the granule. While the GSM product uses a fixed time-averaging scheme (daily, 4-day, 8-day, monthly), the MODSCAG product uses a more sophisticated spatiotemporal interpolation scheme to fill in missing values [58] (e.g., pixels temporarily obscured by clouds.) This yields two additional data products: a “half-climatology” time series of values interpolated backwards in time from the present, available immediately, and a full climatology of values interpolated bidirectionally in time over an entire snow season, available only after the season ends. Thus a citation to these products must also track changes in the interpolation method. We have also shown that provenance can be used to help drive the data publication process [66], which has a direct bearing on the requirements for data citation. Why the ES3 datasets are interesting. In contrast to UPHAR, which is supported by a relational database system, the ES3 datasets are structured as hierarchical files. Since this is a common format for many scientific datasets, using these datasets as a test case will demonstrate the broader applicability of the citation system. The ES3 datasets are also extremely large, much larger than IUPHAR-DB, and will therefore test the scalability of our approach. For example, while it is possible to scan the entire IUPHAR-DB to test whether a citation key holds, it will not be possible to do this for the ES3 datasets.

3

Research Questions

We now describe the computational challenges to be addressed in developing a framework for data citation in which citations are generated automatically from the data; source data is guaranteed to support the generation of the specified citations; and the generated citations are guaranteed to conform to the specifications. We start in Section 3.1 by describing the research issues surrounding the development of a citation language that can operate over a variety of hierarchical structures (including file systems, XML, netDCF, JSON, etc). We then describe in Section 3.2 the research issues involved in validating that citation rules are correct, e.g. are unambiguous and uniquely identify data objects, before moving to research issues involved in incrementally validating citation rules as the dataset evolves (Section 3.3). We close by describing how these ideas must be extended to move beyond databases, e.g. citing Web data (Section3.4). 3 http://dub-oceancolor.eri.ucsb.edu:8080/opendap/

6

/Root/Version[Number=$v]/Family[FName=$f,Authors=$a*] → hresource . . . i hidentifier identifierType=”DOI”i . . . h/identifieri hcreatorsi hcreatori $a h/creatori h/creatorsi htitlei IUPHAR database (IUPHAR-DB) h/titlei hversioni $v h/versioni hdescriptioni hfamilyi $f h/familyi hreceptori $r h/receptori h/descriptioni

... h/resourcei

Figure 2: An incomplete citation generation rule for Datacite format /Root/Version[Number=$v,Editor=$e, DOI=$i, Date=$d] /Data/Family[FamilyName=$f] /Contributor-list/Contributor=$a] /Receptor[ReceptorName=$r, Table=$t] → { DB: IUPHAR, Version: $v, Family: $f Receptor: $r, Contributors: $a∗, Editor: $e, Date: $d, DOI: $i, Table: $t} { DB: ’IUPHAR’, Version: 11, Family: Calcitonin, Receptor: CALCR, Contributors: [Debbie Hay, David R. Poyner], Editor: Tony Harmar, Date: Jan, 2006, DOI: 10.1234, Table: Agonist}

Figure 3: A complete rule and an example (JSON) of what it generates

3.1

Citation Generation Language

We are going to exploit the hierarchical organization of the database and use a simple pattern language [25, 37] that operates on hierarchies. In fact, our language is a small subset of XPath that is applicable to any hierarchical data. The model we use is that of an un-ordered tree with labels on the nodes and data at the leaves. We use /t1 /t2 / . . . for paths, where the ti are node names (XML tags). Moreover, we allow a condition [p1 = v1 , p2 = v2 , . . .] on each node, where the pi are paths and the vi are values. For example, /master/department[dname=’sales’]/employee[name/last=’Crawley’] identifies all employees in the sales department who have Crawley as a last name (we follow XPath in its abuse of the equality symbol). The language is quite general: it is a fragment of XPath, so we can use it for XML data; and a subset of the language can be used to identify nodes in a unix-style directory. For example, consider the expression /birddata/rarebirds/osprey/Skye/2010. The first three labels could identify a file, and the last two could be labels within a JSON file. As another example, in NetCDF the expression . . . /global[institute=’BADC’] could be used to locate a data set using a “global attribute”. The idea we exploit is to use this simple XPath language in “reverse”: not as a system for locating nodes, but as a pattern that binds variables when one is given a node. For example, in an employee database one might be given a node and want to know the department name and employee name associated with that node: /master/department[dname=$d]/employee[name/last=$n]. Here the variables bind to data values. We also want to bind variables to tags, e.g., /birddata/rarebirds/$b/Skye/2010. We can now use such a pattern-matching language to generate the data needed by a citation. Suppose, in IUPHAR-DB, we have identified the web page for a particular receptor. Given a pattern /Root/Version[Number=$v]/Family[FName=$f]/ Receptor[RName=$r], the variables v, f, r would bind to the appropriate values, e.g., v=11, f =‘Calcitonin’, r =‘CT’. From such patterns we can now build rules to generate citations. Figure 2 generates output that conforms to DataCite metadata [96], and is close to what is wanted, by the IUPHAR-DB curators, in a IUPHAR-DB citation. For brevity we shall use JSON syntax as in Figure 3, which shows a complete rule together with an example of what would be generated if someone wanted a citation for the agonist table of the Calcitonin CT receptor. (The “*” in the variable $a* is a grouping instruction.)

7

A citation system is simply a set of such rules. Given a node in the hierarchy, one selects the rule that selects the lowest dominating citable unit and uses that rule to generate the citation. We have already implemented [37] a simple prototype of such a language, but it is limited in several ways. The first is that it is only capable of matching against XML. This brings up our first task. Task 1. Implement a matching language that will operate over a hierarchical interface and implement interfaces for IUPHAR and ES3 variety of hierarchical structures including file systems, XML, netCDF, JSON, “home-grown” formats (e.g. fasta) and relational databases. This involves: 1(a). Designing and implementing the right primitives for access (a combination of indexing and DOM-like traversal primitives). 1(b) Implementing these primities for variety of hierarchical structures including file systems, XML, netCDF, JSON and implementing these through hierarchical views of relational databases. It is easy to implement them for the /table/tuple/value hierarchy of a relational database, but in the case of IUPHAR-DB, the hierarchical HTML presentation is generated, on demand, through a set of database queries. We expect to use Fan’s data publishing [11] software for this purpose.

3.2

Constraint Specification Language

Constraints arise in various ways. In Figure 2 the XML generated is constrained by the DataCite XMLSchema [96], which also imposes cardinality constraints on some fields. As an example of what is required, Figure 3 cites a specific table in IUPHAR-DB. The first constraint is that we require the location of the data reference to be uniquely specified by certain variables. That is, the variables $v=11, $f = ’Calcitonin’ , $r=’CALCR’ and $t=’Agonist’ will, in the absence of bindings for any other variables, uniquely specify a node in the hierarchy. That is, v, f, r give the location of the citable unit and t gives the data reference. In addition to this, the persistent identifier i should bind to a unique value and will (in this case) resolve to the root of the hierarchy. These are location constraints on pattern matching that must be satisfied independently of any other constraints. In addition there are constraints on the descriptive fields: we will require that the date d binds to exactly one value. For example, we may require that there is at least one author a and that there are zero or more editors s. In addition to cardinality constraints, we need to ensure that, in the case of Figure 2, the output is valid with respect to the XML Schema. For example, there is a resourceTypeGeneral field that is restricted to range over a set of values such as dataset, collection, text etc., but – interestingly – does not include database. Can we design a satisfactory constraint specification language? In [25] a very simple, but inadequate, system was proposed which involved annotating the variables. We believe a system can be described based on the well-understood area of complex object types [2]. Task 2. Checking the validity of citations. 2(a) Design of a simple constraint language that will express uniqueness and cardinality constraints. It is an open question as to whether this constraint language can be expressed as a decoration of our simple patternmatching language. This language should express structural constraints that will guarantee conformance to well understood metadata schemas such as described in [96]. 2(b) Central to any constraint language are the classical satisfiability and implication problems. Given a set of constraints and a set of citation rules, the satisfiability problem is to determine whether there exists a database at all that satisfies the constraints and the citation rules, i.e., whether these constraints and rules make sense when put together. The implication problem is to decide whether the constraints and rules entail another constraint as a logical consequence. These problems are not only of theoretical interests, but are also important to validating and optimizing constraints and citation rules. 2(c) Constraint checking in the absence of a source schema. Constraints apply to the citations that are generated. Suppose that there is no specified structure on the database to be cited. For example, it is uncontrained XML or a directory structure in which the files all contain JSON. How can we most efficiently traverse the hierarchy and check that the constraints on the citation are satisfied? We may have several citation rules, and at minimum, we would like to exploit any commonality and verify them together. We would also want to make use of any indexing that is available. 2(d) Checking in the presence of schema. Given, e.g., an XML Schema for the source, can we statically

8

verify that a set of rules will generate valid citations (according to the constraints placed on citations)? This is closely related to type-checking for XML queries [62, 63]. These issues are, however, already nontrivial due to the interaction between constraints and citation rules, and are more intriguing in the presence of source schema. Indeed, it is known that even for unary (relative) XML keys, these problems are already undecidable in the presence of simple DTDs [6]. It is also possible that the source database is not sufficiently constrained by its schema (or the schema is missing), and one needs to query the database itself to check that a specification is valid. For example, the citation specification may assume a hierarchical key constraint, but no such thing is mentioned in the schema. The only way to check that the constraint is valid is to look at the data. Of course, such a check needs to be repeated on any update to the database, and this is one of the reasons for the following section on incremental checking.

3.3

Managing Change in Data and Schema

Collections of data are seldom static. In curated databases one may expect 10% of the data to be modified in a year [34]. Checking the validity of a citation system, even if it is something that can be done relatively efficiently, is not something that one wants to do on every update. In the example in Figure 2, if a new receptor were added or an existing one modified, one would hope that one would only need to validate the citation system on the data for that receptor, but on what basis do we make that judgement? It is a trivial matter to design a rule with a uniqueness constraint that requires (in the absence of indexing) a complete traversal of the database in order to revalidate after a single update. Fortunately, hierarchically structured data permits efficient archiving [34], and since we need to preserve old versions in order that citations can always be resolved, we should be able to combine citation validation with archiving. Task 3. This concerns the change in databases. 3(a) Implement a generic version of our existing archiving system [87] to work with the primitives described in Task 1 and extend the pattern matching language to work on the archived data. This is straightforward for data whose physical organisation is hierarchical or for data (as in IUPHAR-DB) with a logical hierarchy that is easily extracted. 3(b) Study incremental validation in both the absence and presence of a source schema. These require us to study incremental static analyses of citation rules and constraints, such as the satisfiability and implication analyses. The challenge again arises from the interaction between citation rules and constraints. Worse still, as remarked earlier, citation rules and constraints also interact with structural constraints imposed by a source schema when it is present. 3(c) Implement appropriate algorithms for schema-less databases. We want to develop bounded incremental validation algorithms whenever possible, for which the cost can be expressed as a function of the size of changes in the input and output, rather than the size of the entire input (database) [93]. The need for bounded incremental validation algorithms is evident: while databases are updated frequently, their changes are typically small, and as a result, so are the changes to the output. An incremental validation algorithm incurs only the updating costs that are inherent to the incremental problem itself, and is typically far more efficient than unbounded algorithms whose costs are dependent on the size of the database, especially when the database is large.

3.4

Beyond Databases: Citing Web Data

The success of the project will be determined by the simplicity, efficiency and generality of the above methods, and these will in turn depend largely on the research into constraints. However, we are confident that the first three tasks are all feasible, and that we will be able to implement a working citation system for IUPHAR-DB as well as the ES3 datasets that can serve as an exemplar for other curated databases and data collections; there will also be a transferable body of code (see Section 4). Nevertheless, there is one vast data set for which the techniques we have so far described will have to be further developed. This is the Semantic Web, or more precisely RDF. Ostensibly it is a very large and “flat” structure of triples, which is devoid of any hierarchical structure. There is no immediate notion of a citable unit; concepts of authorship, responsibility and currency – essential for citation – have no standardized representation, and, beyond URIs, which are basic values in this structure, there is no higher concept of a persistent identifier. Our goal is organize it in such a way that we could apply the traditional methods of 9

citation. It is interesting that the structure of IUPHAR-DB gives an idea of how one might go about this. Although IUPHAR-DB is seen through a browser as a hierarchical structure, that hierarchy is generated, on the fly, from a database. That is, the citable units, and more generally, each node in the hierarchy corresponds to subsets of tuples in a relational database. The subset is extracted through a simple relational query and the hierarchical relationship corresponds to containment of these queries. Let us try to carry the idea into triple stores. Suppose that there is enough information to generate the information (authorship, date, title, etc.) required for citations in some RDF corpus. Some of this information may be available from the “names” of the collections involved, the fourth column, which is generally available in a triple store. Our strategy will be to define virtual nodes corresponding to persistent references, citable units and data references through queries in SPARQL that return a subset of triples. This yields a hierarchical structure defined by query containment. We also envision creating additional triples and exploiting Semantic Web mechanisms in order to create persistent identifiers as needed. Fulfiling this strategy raises interesting challenges. Clearly full SPARQL is too complex for this task. The baseline for our investigation will be a subset of SPARQL for which checkers can statically verify that a set of virtual nodes defined by a set of queries forms, in fact, a hierarchy. This requires efficient decidability of query containment and of query disjointness (empty intersection). At a later stage, we will question whether SPARQL is the appropriate context for hierarchy-defining set of queries and will investigate the possibility of using alternative language specifications. In order to fit this approach into the framework we propose (see Section 4), we will also need to define a semantics for citation rules when applied to hierarchies defined by sets of queries. We will also need to investigate the principles behind accessing RDF data through such a hierarchy, that is, the foundations of the citation interfaces discussed in Section 4. These investigations may also be useful beyond the issue of RDF data citation, for instance for RDF data entry and RDF data update. Whether or not this is a fruitful approach to citation for Web data will require us to look more closely at a range of examples, notably government data [88] to see if the relevant information is present. There are also questions about the stability of these data sets that must be addressed. We intend to investigate annotation and provenance for RDF in order to suggest how named graphs should be structured in order to enable citation. Task 4. Develop the foundations of a citation framework for RDF data. In particular 4(a) design a language for specifying hierarchies of RDF data by query containment; 4(b) develop a translator from such hierarchy specifications to citation rules (see Section 3.1); 4(c) examine the practicability and usefulness of this approach on benchmarks such as [88]; and 4(d) propose further additions to RDF and surrounding standards that are needed for citations for Web data.

3.5

Extended uses of citations

This proposal is predicated on what we believe is a safe assumption – that the need for citations will not disappear. In fact, there are also emerging ideas that may cause us to re-think the way we in which citations are used, for example the issue of reproducibility of scientific experiments [?]. This has led to substantial work on workflow provenance [45] and for the notion of an “executable paper” – one that is both readable and executable in that it can be used to re-run the code that generated the results [91]. In this case we presumably need “executable citations”, which, like [24] would be understandable to a unix programmer. Related to the idea of executable citation is the notion of “microattribution’ [68]. Suppose we have a query that grabs data from a large number of data sources in different citable units. Can we, from the query, easily construct all the relevant citations? This is closely related to what we intend to investigate for RDF. These departures may or may not become widely adopted, but if they do, will only reinforce the need for the research we have outlined.

10

.-+/-&01( 6+D+5&/+-(

!"#$%&'(8&4-0+( 809+:$(

.-+/-&01(( 2&345+(

67( 809+:$(

!"#$%&'(( 6+>"*'+-(

(,+'+-$#&-( 67( !"#+-(

!"#$%&'(( )'*"'+(

"'>#$55(

!"#$%&'( ;'#+-<$0+(

!"#$%&'( =45+>( !"#$%&'(?$-*+#( 809+:$(

!"#$%&'( )'*"'+(

@4+-A( !"#$%&'B>C(

Figure 4: System architecture

!*+,-./'2.3&$#' 2$"#4,'

()'

!*+,-./' 0/+#&1,$#'

!"#$%#&' !*+,-./'' (#7*9/#&'

!*+,-./' 536#7' !*+,-./'8,&9#+' 2$"#4,'

3:;,+#'

(*,9/.7-$7'

Figure 5: Checking

4

Architecture

The ideas of this proposal will be implemented in a citation system, which will be tested in IUPHAR-DB and the ES3 datasets and made generally available over the web (see Section 7). We describe in this section the system architecture. The citation system (see Figure 4) involves three human actors: the preprocessing developer who uses the preprocessing module to create an interface to the existing database/file system that the citation generator can use; the citation designer/administrator who specifies how the database is to be cited using the generator and checker ; and the citer, i.e., the scientist who walks up to the database and extracts something from it using the citation engine. The process of developing a (specialized) citation interface consists of three phases: Preprocessing Phase A developer creates (offline) a software module that takes as input the existing database schema or other description, and produces a citation source schema and a citation interface. The first is a hierarchical description with a hierarchical key specification and a complex object type. The second is a wrapper that provides a stripped-down DOM- or SAX-like interface that will treat the database as a hierarchy and provide calls for traversing that hierarchy. Depending on the structure of

11

the database, this is trivial (e.g. for XML), easy (e.g. for hierarchical file structures or data formats), or harder. For example, for relational databases this will require more advanced expertise. In the case of IUPHAR-DB, a good candidate for this is the PRATA system [11, 38] developed by Fan, which provides a means to publish a DTD-conscious XML interface to a relational database. Citation Design Phase The citation designer (who could very well be the database administrator) writes an initial version of the requirements: citation rules and constraints as described in Section 3.1 and a citation target schema that may be a standard such as Datacite [41]. There are then two sub-phases: Checking There may be no guarantee that it is possible to implement a citation engine that conforms to the requirements; hence, the checker, which takes as input the database and the products of the previous phase (Figure 5). For example, we may have to check key specifications against the database as explained in Section 3.2. The checker’s diagnostics are used by the citation designer to update the requirements. Multiple iterations may be needed to arrive to a satisfactory specification. Citation Engine Generation At this point, there is enough information for the citation engine generator (Figure 4) to do its work and produce the software engine that outputs actual citations. We note that the generator and the checker are likely to share modules since data structures created by the checker will often be useful to the generator. We also note that while the generator does not need to access the database, it will incorporate the citation interface into the citation engine that it produces so that the engine can access the database in the citation phase. Citation Phase In this phase, the citer activates the citation engine in order to extract one or more citations. Taking as input a query, the citation engine uses its built-in citation interface to access the database (Figure 4). There are several forms the “query” can take: • A unix-style directory path by which the citer obtained the data. Many scientific data sets are just file systems (or mixtures of file systems and hierarchical data formats). • A URL that is used when the citer found the data. • A path that is given to us by the existing user interface to the database. This is the situation with IUPHAR and requires some (not much) work by the people who implement the interface. • A database query in SQL, XQuery or SPARQL. This is the most interesting case, and one that may require the generation of multiple or “micro citations. When the database changes. As indicated in Section 3.3 there may be a need to re-do the second phase (re-check and perhaps re-generate) on database update, especially when the source schema is missing or inadequate. In many curated databases, the times between updates are measured in months, but there are some for which the updates are more frequent, and for which their size prohibits traversing the whole database to check, say, the satisfaction of a key constraint. For such situations we envision an incremental version of the checker that needs to remember the trace of the previous check and determine whether or not additional checking is needed as a result of the updates. As can be summarized from the architectural diagrams, we propose the following deliverables: • D1: Preprocessing modules: this project will deliver modules for XML, file systems and maybe file systems + certain data formats. For other systems (e.g. IUPHAR) we will work with their programmers to deliver a prototype. • D2: (Citation engine) Generator. • D3: Checkers (static and incremental) • D4: Specification languages for the inputs and intermediate products. It may be useful to summarize which of the modules above are not generic. That is, what work would be need to make a new database – other than IUPHAR and ES3 – citable. This is simply the citation schema and citation interface. In addition, when the database has (as with IUPHAR) a user interface, the programmers of that interface will need to use call upon the citation engine to provide citations as part of the interface that are readily accessible. 12

5

Evaluation and Milestones

Evaluation: The citation system will be tested on on two different datasets, selected due to their significant differences: relational versus hierarchical file system, small versus extremely large. In these settings, the system will be evaluated along several dimensions: 1. Setup Difficulty. To use the citation system within the IUPHAR and ES3 systems, developers at those sites will need to create wrapper interfaces. In the case of IUPHAR (since we need to add archived versions) the two tasks can be combined through the use of PRATA [38]. In the case of ES3, a relatively “thin” wrapper akin to the calls that are available in most programming languages to traverse unix-style directories should suffice for citation. However, for location information, wrappers for the internal data formats will have to be constructed. 2. Checker. Here the issues are mostly those of efficiency, and whether or not they can adapt to change. need to scan 3. Scalability. How well does this technique scale to large datasets? We are confident that the techniques will work on IUPHAR, but ES3, while much simpler in overall structure presents an efficiency challenge if we are required to provide data references at a very fine level of granularity. 4. Impact. Are citations being used? This can be tested within each system by measuring the number of times users click on the “Cite Me” button embedded in the data interface. Clicking on the “Cite Me” button will generate references in standard formats, e.g. BibTex or EndNote. We can also measure impact by determining how many of these end up in Google Scholar. At an early stage of the project, through the planned workshops, we expect to get a better idea of the generality of these techniques. We have deliberately chosen two databases that we believe to be at “extremes” of the citation spectrum, but this needs confirmation. We have briefly looked at databases proposed by Dr. Micah Altman (Social Sciences) and Dr. Sarah Callaghan (UK atmospheric and geological data) and believe that they fall within this range. As mentioned in the introduction, there is a large movement to encourage data citation, which involves a change in attitude by scientists and scholars to their use of data. We do not know whether this will happen, but we can be sure that without tools such as the ones we have described, it will not happen. Work Plan: The work plan consists of research and development across three sites: Penn, UCSB and University of Edinburgh. In the Gantt chart, the Edinburgh contributions are represented in grey because they are being funded by a separate EU project, DIACHRON, which is targeted at the preservation of all kinds of data. The archiving and set-up modules are similar for the two projects, and the additional effort will be provided by the IUPHAR staff. In the case of the ES3 data sets, which already have a hierarchical structure and a prototype archiving system, this part of the project is much simpler. In the Gantt chart each task is represented with the initials of the people involved. The first name is the leader for this task. SBD, Susan Davidson; VT, Val Tannen; PB, Peter Buneman; WF, Wenfei Fan; JF, James Frew; SBRA, Santa Barbara Research Assistant; PRA, Penn research assistant; ERA, Edinburgh Research Assistant (EU project, DIACHRON: separate funding). Two workshops are planned, one funded by this project and one organized and funded at the University of Edinburgh. The Digital Curation Centre (DCC) has expressed an interest in supporting this workshop, and funding should be available from them or from the DIACHRON projcet mentioned above.

13

Development Task Language development Constraint checking and OpCmizaCon Incremental Checking CitaCon of Semi-‐structured Data and RDF CitaCon Schema SpeciﬁcaCon CitaCon Interface Design

Personnel PB,SBD,WF,JF,VT,PRA PB,SBD,WF,VT,PRA WF,VT VT, PB SBD,PB,JF,VT,WF VT,PB,SBD,JF

MicrocitaCons, workﬂows, etc CitaCon Engine Generator IUPHAR Archiving IUPHAR Setup IUPHAR Interfaces and TesCng ES3 Archiving ES3 Setup ES3 Setup and TesCng US Workshop UK Workshop

SBD, PB, JF, ERA VT, PB,PRA, SBRA, ERA PB,ERA PB, ERA PB, ERA JF,SBRA JF, SBRA JF, SBRA SBD PB

6

6

12

18

24

30

36

Results of Prior NSF Research

Award ID 0513778: II: Data Cooperatives: Rapid and Incremental Data Sharing with Applications to Bioinformatics Susan B. Davidson: PI; Zack Ives and Val Tannen, co-PIs Award Period: 7/1/2005-6/30/2008, Award Amount: $1,295,278 The proposal concerns the development of generic tools and technologies for creating and maintaining data cooperatives — confederations whose purpose is distributed data sharing – and their application within bioinformatics. Our co-PI’s within biology who are using the prototyped tools are Chris Stoeckert (Genetics, U. Penn.) and Pete White (Children’s Hospital in Philadelphia). Technical results [13, 17, 18, 19, 20, 39, 71, 72, 73, 79, 101, 105] that are being used in the SHARQ project have been obtained in three primary areas: Models for Incomplete and Probabilistic Information, Provenance Management for Collaborative Data Sharing, and Query Interfaces. Award ID 0612177IIS01-00681: SEI+II ProtocolDB: Archiving and Querying Scientific Protocols, Data and Provenance Susan B. Davidson: PI; Collaborative project with Arizona State University Zoe Lacroix, PI Award Period: 8/1/2006-7/31/2009, Award Amount: $323,000. In the UPenn component of this collaborative project, we provided a model of provenance for scientific workflows which is general and sufficiently expressive to answer the provenance queries we encountered in a number of case studies [17, 39]. Based on this model, we developed techniques for focusing user attention on relevant portions of provenance information using “user views” [13, 14]. The current proposal builds on and extends this work to (1) consider a richer workflow model which allows hierarchy, recursion, alternation, and fine-grained dependencies; (2) develop a richer notion of views; (3) develop a connection between views of provenance and the need for access control and personalization; and (4) develop search and query languages which interact with multiple user views. Award ID 0803524 III-COR-Medium: Providing Provenance through Workflows and Database Transformations Susan B. Davidson: PI; Sanjeev Khanna and Val Tannen, co-PIs Award Period: 08/01/2008-7/31/2012, Award Amount: $ 869,230. The objective of this proposal is to provide a framework for unifying workflow and database provenance, and to provide tools that allow a truly comprehensive approach to defining, manipulating, managing and querying the provenance of scientific data. The method is to use a data model that supports nested collections, and a functional language (the Nested Relational Calculus, NRC) to describe workflow specifications and database transformation over nested collections. Results have been obtained in modeling provenance [64, 69, 70], querying provenance [7, 8, 10, 15, 78, 82, 97, 98], articulating privacy issues in provenance [46, 47, 49, 50, 51], and most recently, integrating workflow and database-style provenance using Pig-Latin to elucidate the

14

function of workflows whose modules have memory [3].

7

Broader Impact and Education

As mentioned in the introduction, the ideas of the proposal will have tremendous impact in curated, digital datasets, of which there are an increasing number within science and other application domains. We have consulted with a number of people from this community in writing this proposal, several of whom have written letters of support – Micah Altman (former Archival Director of Harvard’s qualitative data archive, and current Director of Research for the MIT Libraries), and John Kunze (Associate Director, UC Curation Center, California Digital Library) – and others whose datasets are now part of the proposed work (Tony Harmor, Professor of Pharmacology, head of IUPHAR; and James Frew, Director of the Environmental Information Library at UCSB). We also have close connections with many organizations for which data citation is a key issue, such as DataCite, SageCite, The Dataverse Network, SigDC, and the Digital Curation Centre (DCC). For example, a colleague of PI Frew at UCSB, Greg Jan´ee, has extensive experience in digital curation [77, 75, 85, 76], is the developer of the EZID4 persistent identifier management service, and is currently piloting digital curation strategies for the UCSB Library; and co-PI Tannen More specifically the proposed work will impact: 1) Scientists who publish their findings in organized data collections or databases or scientists; 2) Data centers that are charged with the task of publishing and preserving data for an organization, discipline or other enterprise; 3) Businesses and government agencies that provide on-line reference works, such as encyclopedias, gazetteers, business reports, etc. as part of their business; 4) Standards organizations and the like that are trying to formulate principles for data citations; and 5) Any scientist or scholar who wants to – and should be able to – give attribution to data they find on the Web. In these domains, there is a pressing demand by contributors to have the data they have produced properly cited, and there are numerous specifications for the structure and format of citations. However, to have this impact we will need to ensure that results of the project are informed by and made available to these communities. We will therefore work with the DCC and other organizations to disseminate results of this research; in particular, we will organize workshops to gain input from those organizations and other stakeholders. The budget therefore includes funds for a workshop; another may be funded by the Digital Curation Centre at the University of Edinburgh, and they have expressed interest in doing so. The NSF-funded workshop during the first year will bring together at Penn the project personnel; stakeholders such as Harmar, Altman, and Kunze; other representatives from DataCite, SageCite, The Dataverse Network, SigDC, and the Digital Curation Centre (DCC); and other maintainers of digital datasets. At this 2-day workshop, we will gain input on what is broadly needed in citation, and what directions to include other than those in the proposal. We also plan to hold a second workshop, potentially funded by the Digital Curation Centre at the University of Edinburgh, during the last year of the proposal to disseminate ideas of the research. Education. The PIs will also incorporate results of this work in graduate level database and bioinformatics courses that they are involved with at Penn. In particular, CIS 550 - Introduction to Database and Information Systems, is alternatively taught by Davidson or Tannen, and includes several sessions on advanced research topics. They also hope to offer an advanced topics course, CIS650, on topics related to this proposal. The PIs will also pursue giving tutorials and workshops in the digital curation/ digital libraries communities.

4 http://n2t.net/ezid/

15

Budget Justification Project Personnel Support: The budget includes support for the PI’s, Davidson, Tannen and Buneman, as well as support for one graduate student (RA) and one programmer at Penn. The expertise of the graduate student (to be hired) will have to include database theory,systems and linked data; it is also expected that he or she will participate in the implementation. However, since the implementation effort is expected to be significant, we are requesting additional programmer support. The effort for staff/programmers at IUPHAR-DB will be provided . This will include Joanna Sharman (Edinburgh University), who has intimate knowledge of IUPHAR-DB, very good programming experience, and general experience in bioinformatics. The effort for a graduate student at UCSB (who will also do implementation work for the ES3 datasets) is covered in the linked proposal from PI Frew at UCSB. Travel Support Any active database researcher should attend – and present at – at least two international conferences a year. These include ACM SIGMOD, ACM PODS, VLDB, ICDT, ICDE. In addition there are the standard digital libraries conferences (ECDL and JCDL) and digital curation conferences such as IDCC as well as a host of relevant workshops. Since this work brings together two communities, we believe it important to maintain a presence at both their meetings. We are therefore requesting 6 domestic trips for Davidson, Tannen and their RA per year, and 4 international trips per year; at least 2 of these trips per year will be used for trips between Edinburgh and Penn by Buneman and Fan. Research Supplies: Since we will be demonstrating prototype systems at the venues mentioned above, three good laptops are requested as we shall want to demonstrate both the importing of citations and local processing of them. Workshop Support: In the first year of the project, we will bring together at Penn collaborators from IUPHAR and ES3, as well as other members of the general citation community (working with DCC, with which Buneman is intimately involved as a founder). Another workshop is planned for the third year of the project, to disseminate results of this research. We are requesting funds to support the initial workshop; the other will be funded by DCC. Included in the requested $10,000 are standard costs for a workshop with 40-50 participants: room and AV rental; coffee and lunches; conference dinner; publication and web site; and travel for 3-4 invited speakers.

List of Project Personnel and Partner Institutions 1. Susan Davidson; University of Pennsylvania; PI 2. Val Tannen; University of Pennsylvania; co-PI 3. Peter Buneman; University of Edinburgh (Adjunct at University of Pennsylvania); co-PI 4. James Frew; University of California, Santa Barbara; co-PI 5. Wenfei Fan; University of Edinburgh; Senior Personnel

2

Data Management Plan The project only indirectly creates content and data resources. More specifically: • We will take an existing public data resource, IUPHAR-DB (http://www.iuphar-db.org), an open access database providing information on medicinal and experimental drugs and their targets in the body, and extend it to include citations as discussed in the proposal. • We will extend two datasets associated with ES3 (the ocean datasets at http://wiki.icess.ucsb.edu/measures/index.php/GSM and the snow datasets at tp://ftp.eri.ucsb.edu/pub/org/eil/products/MODSCAG) to make them citeable. • We will publish a full grammar for the specification languages. • The research also involves experiments to determine the impact of citations, and will measure the human effort in setting up the citation system using our test cases. The results of this research constitutes data. The above data have no private or confidential aspects. We will disseminate the grammar and experimental results in refereed scholarly publications, and will also associate the results with the software (see below for archiving plans). Software. Our focus is on the development of software for data citation. We will make the tools we develop — the Preprocessing modules, the Generator, the Checkers, and the Specification languages— available in open source, distributed on Google Code under the Apache license. We will also explore the benefits of hosting our software using the popular GitHub site. The GitHub repository is a publicly hosted resource that will remain available in perpetuity. It is also institution-neutral, meaning the data will be available even if the PIs move on. Moreover, the Git version control system actually replicates the repository, meaning that anyone who checks out files will continue to have a local copy. The tools and ideas described in this proposal will also be publicized to the scientific computing and community through the associated workshops, one funded by this proposal in Year 1 and the other in Year 3 with proposed funding from the DCC (with which Buneman is closely involved). The tools will also be demonstrated at relevant conferences and workshops in the database community (e.g. ACM SIGMOD, VLDB, ICDT, ICDE), digital libraries community (e.g. ECDL and JCDL) and digital curation conferences (e.g. IDCC).

3

Collaboration Plan The proposal involves Susan Davidson (PI) and Val Tannen (co-PI) at the University of Pennsylvania; Peter Buneman at Edinburgh University; and Jim Frew at UC Santa Barbara. In addition, we will collaborate closely with Wenfei Fan, who is listed as Senior Personnel (see attached letter of support). Wenfei Fan is Professor of Web Data Management, whose research interests include data quality, distributed query processing, query languages, XML and Web services. His work has led to standards, patentable methods and working systems being used in industry. The investigators have a strong record of collaboration, and the coordination mechanisms described below will continue to ensure collaboration. In particular, Buneman, Davidson and Tannen collaborated on the K2/Kleisli data integration system [31, 43, 100] while Buneman was on the faculty at Penn; Davidson and Tannen have continued the collaboration on data integration (the SHARQ project [18, 72]), as well as a project to “marry” database-style and workflow provenance [3]. Fan has collaborated with Buneman and Davidson over constraints for XML [30, 23, 44]; Davidson and Fan also co-advised a student at Penn on this topic (Carmem Hara). Buneman and Fan are in the same research group and have had extensive collaborations since moving to Edinburgh. They both pay frequent visits to Penn. The specific roles of the PI and co-PI follow their respective areas of expertise: • Susan Davidson will be responsible for overall project management, including managing the workshops, visits, student exchanges, project meetings, and teleconferences. Her expertise related to this proposal includes modeling provenance and user views in workflow systems [3, 14, 21, 39, 47, 48], provenance and privacy [46, 49, 50] and provenance query optimization [9, 10]. She will continue her collaborations with Fan on constraints and updates, and work with Buneman and Tannen on rule based path languages for data citation, including extensions to web data. • Val Tannen will be responsible for developing the proposed citation language and its implementation, and will collaborate with Buneman and Davidson on these ideas. His expertise related to this proposal includes nested collections query languages [99], foundations of data sharing [71, 72, 73, 79, 101], and database provenance [3, 5, 64, 71]. • Peter Buneman will be responsible for managing the deployment of ideas within IUPHAR. He will also collaborate with Davidson, Tannen and Fan over languages and constraints. His expertise related to this proposal includes database semantics, approximate information, query languages, types for databases, data integration, bioinformatics and semistructured data [90, 22, 36, 1, 33]. He has also worked on issues associated with scientific databases such as data provenance, archiving and annotation [35, 34, 26, 29, 28, 81]. • Wenfei Fan will be responsible for developing the research related to constraints and updates, and has a student dedicated to the project (Yang Cao) as indicated in his letter of support. He will continue collaborations with Davidson and Buneman over these ideas. His expertise related to this proposal includes pioneering work on constraint checking in XML [63, 62] and on path constraints [32]. He has also investigated incremental versions of various database tasks, e.g., inconsitency detection [61], and is well equipped to deal with similar incremental problems in constraint checking. Most importantly we will make use of a variant of his PRATA system [12, 16] which we will need in creating database wrappers that provide a hierarchical interface. • Jim Frew will be responsible for testing the ideas of this proposal in the Earth System Scence Server (ES3) project. His expertise related to this proposal includes the development of two provenance capture systems [65, 67] and their integrated into Earth science computing environments [57, 58]. He is currently exploring the use of provenance to inform publication decisions [66]. Specific coordination mechanisms: The PIs have a history of successful collaboration. It is not difficult for the PI and co-PI at Penn to co-ordinate since they are at the same institution with nearby offices; furthermore, they hold weekly Database Research Group meetings.5 It is also not difficult for Buneman and Fan to co-ordinate since they are at the same institution. Specific mechanisms for managing the distance collaborations will include: 5 See

http://db.cis.upenn.edu/

4

1. Conference calls. Davidson, Buneman, Tannen and their students will hold bi-weekly Skype meetings to discuss progress on ideas related to this proposal. We will include Frew and Harmar (head of the IUPHAR-DB, see letter of support) once a month to get input on technical directions. 2. Students and Postdocs. We have found that visits by students and postdocs are an excellent way of sharing ideas. We will arrange for students in the UK and the supported student at UCSB to visit Penn for short term visits to exchange ideas. 3. PI Visits. Besides students and postdocs, we plan to host Buneman and Fan for short visits . Fan currently comes back to Philadelphia every summer, and Buneman makes frequent visits in the fall. 4. Workshop. One workshop during the first year will be supported on this project. In addition to the project personnel, stakeholders, such as Harmar, Altman (former Archival Director of Harvard’s qualitative data archive, and current Director of Research for the MIT Libraries – see attached letter of support), and Kunze (Associate Director, UC Curation Center, California Digital Library – see attached letter of support) and beyond will be invited to Penn for two days to give input on what is broadly needed in citation, and what directions to include other than those in the proposal. We also plan to hold a second workshop, potentially funded by the Digital Curation Centre at the Uinversity of Edinburgh, during the last year of the proposal to disseminate ideas of the research. 5. Project website and WiKi page. We have found that for large, distributed projects a Wiki page is extremely effective (see http://phylodata.seas.upenn.edu/cgi-bin/wiki/pmwiki.php for the pPOD project Wiki page). This is an excellent mechanism for disseminating discussions that have taken place between subgroups of participants, as well as posting papers. We will also set one up for this project. 6. SVN repository. An SVN repository has been set up and is maintained at Penn for writing papers (and this proposal). The budget includes travel money to support two trips per year for visits to Penn by Buneman and Fan, as well as travel for reverse visits by PIs, student visits, and trips to conferences at which our results will be presented.

5

References [1] Serge Abiteboul, Peter Buneman, and Dan Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 1999. [2] Serge Abiteboul and Richard Hull. IFO: a Formal Semantic Database Model. ACM Trans. Database Syst., 12:525–565, Nov 1987. [3] Yael Amsterdamer, Susan B. Davidson, Daniel Deutch, Tova Milo, Julia Stoyanovich, and Val Tannen. Putting Lipstick on Pig: Enabling Database-style Workflow Provenance. PVLDB, 5(4):346–357, 2011. [4] Yael Amsterdamer, Daniel Deutch, Tova Milo, and Val Tannen. On provenance minimization. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 141–152, 2011. [5] Yael Amsterdamer, Daniel Deutch, and Val Tannen. On the limitations of provenance for queries with difference. CoRR, abs/1105.2255, 2011. [6] Marcelo Arenas, Wenfei Fan, and Leonid Libkin. On verifying consistency of XML Specifications, booktitle = PODS, year = 2002, pages = 259-270, ee = http://doi.acm.org/10.1145/543613.543647, http://www.acm.org/sigs/sigmod/pods/proc02/papers/259-ArenasFL.pdf, bibsource = DBLP, http://dblp.uni-trier.de. [7] Zhuowei Bao, Sarah Cohen Boulakia, Susan B. Davidson, Anat Eyal, and Sanjeev Khanna. Differencing provenance in scientific workflows. In ICDE, pages 808–819, 2009. [8] Zhuowei Bao, Sarah Cohen Boulakia, Susan B. Davidson, and Pierrick Girard. PDiffView: Viewing the difference in provenance of workflow results. PVLDB, 2(2):1638–1641, 2009. [9] Zhuowei Bao, Susan B. Davidson, Sanjeev Khanna, and Sudeepa Roy. An optimal labeling scheme for workflow provenance using skeleton labels. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD, pages 711–722, 2010. [10] Zhuowei Bao, Susan B. Davidson, and Tova Milo. Labeling recursive workflow executions on-the-fly. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD, pages 493–504, 2011. [11] Michael Benedikt, Chee Yong Chan, and W Fan et al. DTD-directed publishing with attribute translation grammars. In VLDB, 2002. [12] Michael Benedikt, Chee Yong Chan, and Wenfei Fan et al. Capturing both types and constraints in data integration. In ACM SIGMOD, 2003. [13] Olivier Biton, Sarah Cohen Boulakia, and Susan B. Davidson. Zoom*UserViews: Querying relevant provenance in workflow systems. In VLDB, pages 1366–1369, 2007. [14] Olivier Biton, Sarah Cohen Boulakia, Susan B. Davidson, and Carmem S. Hara. Querying and Managing Provenance through User Views in Scientific Workflows. In ICDE, pages 1072–1081, 2008. [15] Olivier Biton, Susan B. Davidson, Sanjeev Khanna, and Sudeepa Roy. Optimizing user views for workflows. In ICDT ’09: Proceedings of the 12th International Conference on Database Theory, pages 310–323, 2009. [16] Philip Bohannon, Byron Choi, and Wenfei Fan. Incremental evaluation of schema-directed XML publishing. In ACM SIGMOD, 2004. [17] Sarah Cohen Boulakia, Olivier Biton, Shirley Cohen, and Susan B. Davidson. Addressing the provenance challenge using ZOOM. Concurrency and Computation: Practice and Experience, 20(5):497–506, 2008.

6

[18] Sarah Cohen Boulakia, Olivier Biton, Shirley Cohen, Zachary Ives, Val Tannen, and Susan Davidson. SHARQ Guide: Finding relevant biological data and queries in a peer data management system. In International Workshop on Data Integration in the Life Sciences (DILS), Poster proceedings, 2006. [19] Sarah Cohen Boulakia, Olivier Biton, Susan B. Davidson, and Christine Froidevaux. BioGuideSRS: querying multiple sources with a user-centric perspective. Bioinformatics, 23(10):1301–1303, 2007. [20] Sarah Cohen Boulakia, Susan B. Davidson, Christine Froidevaux, Zo´e Lacroix, and Maria-Esther Vidal. Path-based systems to guide scientists in the maze of biological data sources. J. Bioinformatics and Computational Biology, 4(5):1069–1096, 2006. [21] Shawn Bowers, Timothy M. McPhillips, Bertram Lud¨ascher, Shirley Cohen, and Susan B. Davidson. A model for user-oriented data provenance in pipelined scientific workflows. In IPAW, volume 4145 of LNCS, pages 133–147. Springer, 2006. [22] Val Breazu-Tannen, Peter Buneman, Shamim Naqvi, and Limsoon Wong. Principles of Programming with Collection Types. Theoretical Computer Science, 149:3–48, 1995. [23] P. Buneman, S. Davidson, W. Fan, C. Hara, and W.C. Tan. Reasoning about keys for XML. In Proceedings of the Workshop on Database Programming Languages, pages 133–148, 2001. [24] Peter

Buneman. wget -qO - http://mirror.hmc.edu/ctan/FILES.byname | grep ".bst$" | sed ’s/.*\/$.*$/\1/’ | sort -u | wc -l Executed on 18 November 2011.

[25] Peter Buneman. How to cite curated databases and how to make them citable. In SSDBM, pages 195–203, 2006. [26] Peter Buneman, Adriane Chapman, and James Cheney. Provenance management in curated databases. In SIGMOD Conference, pages 539–550, 2006. [27] Peter Buneman, James Cheney, Wang Chiew Tan, and Stijn Vansummeren. Curated databases. In ACM PODS, pages 1–12, 2008. [28] Peter Buneman, James Cheney, Wang Chiew Tan, and Stijn Vansummeren. Curated databases. In PODS, pages 1–12, 2008. [29] Peter Buneman, James Cheney, and Stijn Vansummeren. On the expressiveness of implicit provenance in query and update languages. ACM Trans. Database Syst., 33(4), 2008. [30] Peter Buneman, Susan Davidson, Wenfei Fan, Carmem Hara, and Wang-Chiew Tan. Keys for XML. In WWW10, pages 201–210, 2001. [31] Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. A data transformation system for biological data sources. In Proceedings of VLDB, pages 158–169, Sept 1995. [32] Peter Buneman, Wenfei Fan, and Scott Weinstein. Interaction between path and type constraints. ACM Trans. Comput. Log., 4(4):530–577, 2003. [33] Peter Buneman, Mary Fernandez, and Dan Suciu. UnQL: A Query Language and Algebra for Semistructured Data Based on Structural Recursion. VLDB Journal, 9(1):75–110, 2000. [34] Peter Buneman, Sanjeev Khanna, Keishi Tajima, and Wang-Chiew Tan. Archiving Scientific Data. ACM Trans. Database Syst., 29:2–42, 2004. [35] Peter Buneman, Sanjeev Khanna, and Wang Chiew Tan. Why and where: A characterization of data provenance. In ICDT, pages 316–330, 2001. [36] Peter Buneman and Atsushi Ohori. Polymorphism and Type Inference in Database Programming. ACM Transactions on Database Systems, 21(1):30–76, March 1996. [37] Peter Buneman and Gianmaria Silvello. A Rule-Based Citation System for Structured and Evolving Datasets. IEEE Data Eng. Bull., 33(3):33–41, 2010. 7

[38] Byron Choi, Xibei Jia, and et al. A uniform system for publishing and maintaining XML data. In IN PROC. INTL. CONF. ON VERY LARGE DATA BASES, pages 1301–1304. Demo, 2004. [39] Shirley Cohen, Sarah Cohen Boulakia, and Susan B. Davidson. Towards a model of provenance and user views in scientific workflows. In DILS, pages 264–279, 2006. [40] P. Cornillon, J. Gallagher, and T. Sgouros. OPeNDAP: Accessing data in a distributed, heterogeneous environment. Data Science Journal, 2:164–174, 2003. [41] Datacite. http://datacite.org/. Visited Nov 2011. [42] The Dataverse Network. http://thedata.org/. Visited Nov 2011. [43] Susan B. Davidson, Jonathan Crabtree, Brian P. Brunk, Jonathan Schug, Val Tannen, G. Christian Overton, and Christian J. Stoeckert Jr. K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. IBM Systems Journal, 40(2):512–531, 2001. [44] Susan B. Davidson, Wenfei Fan, Carmem S. Hara, and Jing Qin. Propagating XML constraints to relations. In ICDE, pages 543–554, 2003. [45] Susan B. Davidson and Juliana Freire. Provenance and scientific workflows: challenges and opportunities. In SIGMOD Conference, pages 1345–1350, 2008. [46] Susan B. Davidson, Sanjeev Khanna, Tova Milo, Debmalya Panigrahi, and Sudeepa Roy. Provenance views for module privacy. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 175–186, 2011. [47] Susan B. Davidson, Sanjeev Khanna, Debmalya Panigrahi, and Sudeepa Roy. Preserving module privacy in workflow provenance. Manuscript available at http://arxiv.org/abs/1005.5543., 2010. [48] Susan B. Davidson, Sanjeev Khanna, Sudeepa Roy, and Sarah Cohen-Boulakia. Privacy issues in scientific workflow provenance. In Proceedings of the 1st International Workshop on Workflow Approaches for New Data-Centric Science, June 2010. [49] Susan B. Davidson, Sanjeev Khanna, Sudeepa Roy, Julia Stoyanovich, Val Tannen, and Yi Chen. On provenance and privacy. In ICDT, pages 3–10, 2011. [50] Susan B. Davidson, Sanjeev Khanna, Val Tannen, Sudeepa Roy, Yi Chen, Tova Milo, and Julia Stoyanovich. Enabling privacy in provenance-aware workflow systems. In CIDR, pages 215–218. www.crdrdb.org, 2011. [51] Susan B. Davidson, Soohyun Lee, and Julia Stoyanovich. Keyword search in workflow repositories with access control. In Pablo Barcel´ o and Val Tannen, editors, AMW, volume 749 of CEUR Workshop Proceedings. CEUR-WS.org, 2011. [52] The Digital Curation Centre. http://www.dcc.ac.uk/. Visited Nov 2011. [53] Alin Deutsch, Lucian Popa, and Val Tannen. Physical Data Independence, Constraints and Optimization with Universal Plans. In International Conference on Very Large Databases (VLDB), September 1999. [54] Alin Deutsch and Val Tannen. Containment and integrity constraints for XPath. In In Proceedings of the 8th International Workshop on Knowledge Representation meets Databases (KRDB), 2001. [55] Alin Deutsch and Val Tannen. MARS: A System for Publishing XML from Mixed and Redundant Storage. In VLDB, pages 201–212, 2003. [56] Alin Deutsch and Val Tannen. Reformulation of XML queries and constraints. In ICDT, 2003. [57] J. Dozier and J. Frew. Computational provenance in hydrologic science: A snow mapping example. Phil. Trans. R. Soc. A, 367(1890):1021–1033, 2008.

8

[58] J. Dozier, T.H. Painter, K. Rittger, and J. Frew. Time–space continuity of daily maps of fractional snow cover and albedo from MODIS. Advances in Water Resources, 31(11):1515–1526, 2008. [59] Dublin Core Metadata Element Set, Version 1.1. http://dublincore.org/documents/dces/. Visited Nov 2011. [60] Albert Einstein. Ist die Tr¨ agheit eines K¨orpers von seinem Energieinhalt abh¨angig? Physik, 18(13):639–641, 1905.

Annalen der

[61] Wenfei Fan, Jianzhong Li, Nan Tang, and Wenyuan Yu. Incremental detection of inconsistencies in distributed data. In ICDE, pages 318–329, 2012. [62] Wenfei Fan and Leonid Libkin. On XML integrity constraints in the presence of DTDs. JACM, 49(3):368–406, May 2002. [63] Wenfei Fan and J´erˆ ome Sim´eon. Integrity constraints for XML. In ACM Principles of Database Systems, pages 23–34, 2000. [64] J. Nathan Foster, Todd J. Green, and Val Tannen. Annotated XML: queries and provenance. In PODS, pages 271–280, 2008. [65] J. Frew and R. Bose. Earth system science workbench: A data management infrastructure for earth science products. In L. Kerschberg and M. Kafatos, editors, SSDBM 2001 Thirteenth International Conference on Scientific and Statistical Database Management, pages 180–189. IEEE Computer Society, 2001. [66] J. Frew, G. Jan´ee, and P. Slaughter. Provenance-enabled automatic data publishing. In Scientific and Statistical Database Management, volume 6809 of Lecture Notes in Computer Science, pages 244–252. Springer, 2011. [67] J. Frew, D. Metzger, and P. Slaughter. Automatic capture and reconstruction of computational provenance. Concurrency and Computation: Practice and Experience, 20(5):485–496, 2008. [68] Belinda Giardine et al. Systematic documentation and analysis of human genetic variation in hemoglobinopathies using the microattribution approach. Nature Genetics, 43:295301, 2011. [69] Todd J. Green. Containment of conjunctive queries on annotated relations. In ICDT, Saint Petersburg, Russia, March 2009. Best Student Paper Award. [70] Todd J. Green, Zachary G. Ives, and Val Tannen. Reconcilable differences. In ICDT, pages 212–224, 2009. [71] Todd J. Green, Gregory Karvounarakis, and Val Tannen. Provenance semirings. In PODS, pages 31–40, 2007. [72] Todd J. Green, Gregory Karvounarakis, Nicholas E. Taylor, Olivier Biton, Zachary G. Ives, and Val Tannen. Orchestra: facilitating collaborative data sharing. In SIGMOD Conference, pages 1131–1133, 2007. [73] Todd J. Green, Grigoris Karvounarakis, Zachary G. Ives, and Val Tannen. Update exchange with mappings and provenance. In VLDB, pages 675–686, 2007. [74] International Association for Social Science Information Services & Technology. iassistdata.org/. Visited Nov 2011.

http://www.

[75] G. Jan´ee. Preserving geospatial data: The national geospatial digital archive’s approach. In Archiving 2009, pages 25–29. Society for Imaging Science and Technology, 2009. [76] G. Jan´ee, J. Frew, and T. Moore. Relay-supporting archives: Requirements and progress. International Journal of Digital Curation, 4(1):57–70, 2009.

9

[77] G. Jan´ee, J. Mathena, and J. Frew. A data model and architecture for long-term preservation. In Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 134–144. ACM, 2008. [78] Grigoris Karvounarakis, Zachary G. Ives, and Val Tannen. Querying data provenance. In SIGMOD Conference, pages 951–962, 2010. [79] Ioanna Koffina, Giorgos Serfiotis, Vassilis Christophides, and Val Tannen. Mediating RDF/S queries to relational and XML sources. Int. J. Semantic Web Inf. Syst., 2(4):68–91, 2006. [80] Ioanna Koffina, Giorgos Serfiotis, Vassilis Christophides, Val Tannen, and Alin Deutsch. Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM). In Semantic Interoperability and Integration, 2005. [81] Egor V. Kostylev and Peter Buneman. Combining dependent annotations for relational algebra. In ICDT, pages 196–207, 2012. [82] Ziyang Liu, Susan B. Davidson, and Yi Chen. Generating sound workflow views for correct provenance analysis. ACM Trans. Database Syst., 36(1):6, 2011. [83] Aimilia Magkanaraki, Val Tannen, Vassilis Christophides, and Dimitris Plexousakis. Viewing the semantic web through RVL lenses. J. Web Sem., 1(4):359–375, 2004. [84] S. Maritorena, O. Hembise Fanton d’Andon, A. Mangin, and D. Siegel. Merged satellite ocean color data products using a bio-optical model: Characteristics, benefits and issues. Remote Sensing of Environment, 114(8):1791–1804, 2010. [85] G. McGarva, S. Morris, , and G. Jan´ee. Preserving geospatial data. Technology Watch Series 09-01, Digital Preservation Coalition, 2009. [86] Tova Milo, Dan Suciu, and Victor Vianu. Typechecking for XML transformers. J. Comput. Syst. Sci., 66(1):66–97, 2003. [87] Heiko M¨ uller. XARCH – the XML Archiver. http://xarch.sourceforge.net/. Visited Nov 2011. [88] Open Knowledge Foundation. http://okfn.org/. Visited Nov 2011. [89] T.H. Painter, K. Rittger, C. McKenzie, P. Slaughter, R.E. Davis, and J. Dozier. Retrieval of subpixel snow covered area, grain size, and albedo from MODIS. Remote Sensing of Environment, 113(4):868– 879, 2009. [90] Peter Buneman and Susan Davidson and Kyle Hart and Chris Overton and L. Wong. A Data Transformation System for Biological Data Sources. In Proceedings of VLDB, Sep 1995. [91] Piotr Nowakowskia et al. The Collage Authoring Environment. In Proceedings of the International Conference on Computational Science, volume 4, pages 608–617, 2011. [92] Lucian Popa, Alin Deutsch, Arnaud Sahuguet, and Val Tannen. A Chase Too Far? In Proceedings of ACM SIGMOD International Conference on Management of Data, May 2000. [93] G. Ramalingam and Thomas Reps. On the computational complexity of dynamic graph problems. TCS, 158(1-2), 1996. [94] SageCite. http://www.ukoln.ac.uk/projects/sagecite/. Visited Nov 2011. [95] Giorgos Serfiotis, Ioanna Koffina, Vassilis Christophides, and Val Tannen. Containment and minimization of RDF/S query patterns. In International Semantic Web Conference, pages 607–623, 2005. [96] Joan Starr et al. DataCite Metadata Scheme for the Publication and Citation of Research Data, January 2011. http://datacite.org/ schema/DataCite-MetadataKernel v2.0.pdf, visited Oct 2011.

10

[97] Julia Stoyanovich, Ben Taskar, and Susan Davidson. Exploring repositories of scientific workflows. In Proceedings of WANDS, 2010. [98] Peng Sun, Ziyang Liu, Susan B. Davidson, Siva N., and Yi Chen. WOLVES: Achieving Correct Provenance Analysis by Detecting and Resolving Unsound Workflow Views. In PVLDB, 2009. [99] Val Tannen, Peter Buneman, and Limsoon Wong. Naturally embedded query languages. In ICDT, pages 140–154, 1992. [100] Val Tannen, Susan B. Davidson, and Scott Harker. The information integration system K2. In Bioinformatics: Managing Scientific Data. Elsevier, 2003. [101] Nicholas E. Taylor and Zachary G. Ives. Reconciling while tolerating disagreement in collaborative data sharing. In SIGMOD Conference, pages 13–24, 2006. [102] Named graphs. http://www.w3.org/2004/03/trix/. Referenced September 2012. [103] J.D. Watson and F.H.C. Crick. A Structure for Deoxyribose Nucleic Acid. Nature, 171, 1953. [104] Wikipedia/Citation. Guides”insert.

http://en.wikipedia.org/wiki/Citation.

Visited Nov 2011. See “Style

[105] Yifeng Zheng, Stephen Fisher, Shirley Cohen, Sheng Guo, Junhyong Kim, and Susan B. Davidson. Crimson: A data management system to support evaluating phylogenetic tree reconstruction algorithms. In VLDB, pages 1231–1234, 2006.

11

Collaborative Scheduling of DAG Structured ...

Collaborative Research among Philippine State Colleges and ...

Structured Data Meets the Web: A Few ... - Research at Google

Richer Syntactic Dependencies for Structured ... - Microsoft Research

Attack Resistant Collaborative Filtering - Research at Google

Combinational Collaborative Filtering for ... - Research at Google

Latent Collaborative Retrieval - Research at Google

university-business collaborative research: goals, outcomes and new ...

Local Collaborative Ranking - Research at Google

Collaborative Research among Philippine State ...

Efficient Inference and Structured Learning for ... - Research at Google

HS Source Citing MHSLibWebsite.pdf

Interfacing structured and unstructured data in ...

PDF Agile Data Warehouse Design: Collaborative Dimensional ...

Google Structured Data Testing Tool.pdf

Effective OLAP Mining of Evolving Data Marts

Clustering Based Active Learning for Evolving Data ...