Submitted by Peter.Wittenburg on Wed, 09/18/2013 - 14:14

Peter Wittenburg, Daan Broeder, Max Planck Institute for Psycholinguistics
 
Cluster project meeting
Delegates from all the EC-funded cluster projects met recently to discuss similarities and differences in relation to the major problems arising in the various projects. These cluster projects cover a range of areas: CRISP – physics, BIOmedBridges – life sciences, ENVRI – environmental sciences, and DASISH – social sciences and humanities. Together these clusters represent quite a number of ESFRI research infrastructure initiatives. The discussions at the meeting revealed that there are large differences, not only within each cluster project but also within each research infrastructure, in regard to data organization and computational aspects. This means that – no matter what the members of the cluster projects agree upon as major issues to be tackled by the interaction of the projects – we must take these differences in the state of IT usage into account in our solution process. It is going to require a lot of effort – at European, national and also community levels – to equalize the state of IT usage across the cluster communities, and doing this is probably the largest challenge we face! However, irrespective of the outcome of individual cluster projects, the meeting highlighted a substantial benefit of the interactions between the clusters: the cross-fertilization between research communities (and resulting increase in awareness about advanced methods and technologies) has been an unexpected bonus.
 
Results of the meeting
During the meeting, we identified four different areas where major work needs to be undertaken:

  1. defining mechanisms to identify data, software and users,
  2. implementing standards for data formats, and establishing facilities for data and services discovery and access,
  3. establishing safe storage facilities (that also provide for handling incomplete data), and
  4. initiating community engagement that provides bridging between communities and semantic annotation facilities.

Delegates from the meeting are now working on specifying details relating to these different challenges, as an initial step in the process of indicating common solutions.
 
Instead of elaborating on all four of these items at this time, the rest of this article will focus on discussing the first item in more detail. This initial item relates to the challenges associated with identifying and tracking data and software objects (along with the concepts used to describe these objects) and also the users producing and accessing the objects. Issues such as versioning and data collections, and workflows and orchestrated services, make this a complicated challenge.  The challenges relating to identification have emerged in an era where ever-growing quantities of data are being stored, exchanged, re-used and enriched by an increasing number of software components – applied by not only researchers from various disciplines but also by citizens operating in a largely anonymous fashion. Consequently the handling of identities plays a crucial role in data management nowadays. We must establish suitable methods for maintaining stable and proven identities so we will be able to reliably track the following:
 

  • whether a certain data object is still the one we want to access,
  • what happened to a data object during its life time, what kind of derived objects were produced by which software components,
  • how software components have changed over time,
  • how the semantics of the terms used in data and metadata have changed over time, and
  • which user re-used objects, created new derived objects and so forth.

 
To address these issues relating to establishing and tracking identities, we urgently need to move towards solutions which incorporate the following points. 

  • We need a world-wide system that allows every data repository contributing to an open domain of long-term accessible data to register and resolve Persistent Identifiers (PID) for all the data objects being created as a result of scientific workflows and to store attributes with the PIDs. The urgency of this topic has already been addressed at the ICRI conference by a manifesto signed by a large number of initiatives (http://dasish.eu/manifesto).
  • Increasingly more data objects are being created by software components as part of scientific workflows which are updated regularly due to algorithmic improvements and/or technological innovation. In these situations we need a method of identifying specific software objects and workflows with the help of Persistent Identifiers which point to stored software versions and are associated with useful attributes.
  • In almost all scientific disciplines,  we see that the future data fabric (determined by automatic procedures that implement policies) contains a continuum of objects from raw data up to publications (all of which are part of a domain of referable or citable data objects). Smart algorithms would allow tracing back data flows, source-sink relationships, versioning and so forth by making use of the PID and/or metadata information.
  • Often data streams (of data and/or metadata) not only encode numbers, but also terms bearing some semantics and these are embedded in schemas that definte the semantic context. To be able to interpret such data, one needs to be able to interpret the meaning of the terms that are used. Thus we urgently need to establish a domain of open concept registries that can be used and are managed by scientific communities which allow scientists to easily register and define their concepts.
  • Data accessibility via the internet fosters re-usage and enrichment of existing data, as well as the creation of new data creation, in the sense of the data fabric mentioned above, both by researchers and also citizens. It is obvious that in a domain of trusted science, we need not only to be able to prove the identity of data and software objects and track data flows, but we also need to also to know (in a verifiable way) the roles of the specific people involved in the data continuum, for example, who created the objects. Thus we need to move towards a unified worldwide system for the registration of identities of the actors involved.

 
It was interesting to see that, despite having different disciplinary backgrounds (from physics to humanities), the delegates almost agreed on all the items mentioned. Obviously concept identity and semantic bridging does not have a high priority for CRISP, and similarly CRISP and BIOmedBridges do not see the urgency of volatile data management.


Next steps
The delegates from the meeting are in the process of creating a document detailing all four of the areas where we need to undertake further work. Once that document is completed, we will send it to the different cluster projects for internal discussion, which will serve as the basis for producing  a joint position statement.
 
For EUDAT, it is interesting to note the outcomes from the cluster project meeting as the EUDAT project is already dealing with some of these challenges (for example, in the task forces working on service delivery), and EUDAT – like the cluster projects – has decided to have a domain of registered data with metadata descriptions and PIDs assigned to all objects.