Putting research communities in the driving seat of e-infrastructures

Background

The European Commission has established various pan-European initiatives to develop e-infrastructures to support the work of European research communities. It has been a challenge for these initiatives to find effectiveways to engage with the communitiesin the process of developing the infrastructure services. Too often this engagement has been reduced to the existence of a forum where communities are supposed to give advice on the services being developed, which often has had little practical effect on the development of the actual services. On other occasions, e-infrastructures have been dominated by a limited number of stakeholders – which has sometimes lead to over-representation from a single community or institute. Consequently, although the services that were developed fitted the needs of that particular community or institute, it was subsequently hard to extend those services so as to be of benefit to other research communities with different requirements.

Agile user community involvement in common data service development – from selection to delivery – is both possible and necessary, and can also be very fruitful. However community involvement is time-consuming, requiring much in the way of commitment and communication, and thus restricts the type of experts who can participate.

There are a few examples where communities and technology providers have been able to effectively co-design services. In the USA, initiatives such as DataONE have earned a reputation for being genuinely community-driven infrastructures– this has been achieved by havingnon-IT people in the leading roles and by organizinga broad range of forums to engage varioustypes of domain experts. In Europe, the ESFRI initiatives are also taking this path, although itis too early as yet to draw solid conclusions fromthe dynamics being created in the design of theinfrastructures. What is clear is that the developmentof each of these research infrastructures is being driven by requirements that originate from a specific discipline. Thus it is likely that the infrastructures that are built will indeed match these requirements. However, there are requirements that go beyond the barriers of individual research disciplines and hence these requirements can be assumed to be shared more broadly. The ESFRI cluster projects are trying to foster the development of cross-disciplinary practices and requirements within broader scientific domains. The EUDAT project is going a step further by exploring ways to build generic technical services that can support multiple research communities. EUDAT is working with a wide range of communities to deliver these technical services as part of the EUDAT Collaborative Data Infrastructure (CDI). To be successful in this ambitious initiative, EUDAT is using novel methods to involve all the stakeholders, both in the discussions to determine the required services, and in the process of designing, developing and implementing those services.

Building a CDI driven by research communities: an ambitious but promising activity

EUDAT is a pan-European initiative that started in October 2011. It brings together 25 partners, including research communities, national data and high performance computing (HPC) centres, technology providers, and funding agencies from 13 countries. The EUDAT project is building a sustainable cross-disciplinary and cross-national data infrastructure providing a set of shared services for accessing and preserving research data. The project began by reviewing the approaches and requirements of the research communities that were initially involved with the project – these included communities from linguistics (CLARIN), solid earth sciences (EPOS), climate sciences (ENES), environmental sciences (LIFEWATCH), and biological and medical sciences (VPH). As a result of this, four services of common interest were shortlisted to be deployed as the initial shared services on the EUDAT infrastructure. These services are data replication from site to site, data staging to compute facilities, a metadata catalogue, and easily shareable storage.

EUDAT has been working on the principle that the research communities (which were represented in the discussions by some strong research centres) should be in the driving seat for selecting the main services, and their functional requirements, from the very beginning. In contrast to how services are sometimes built – starting from a first phase of requirements gathering which involves potential users, before moving on to developing those services without the involvement of the same potential users − researchers and community managers have contributed to the full EUDAT process, directly participating in the design and development of the services through multi-disciplinary task forces. EUDAT has also successfully established a non-linear and flexible discussion process which allows suggestions for new services to be made at any time, thanks to the frequent interactions between EUDAT’s stakeholders.

Building a sustained dialogue between individuals who come from such diverse organizational, cultural and disciplinary backgrounds has been a challenge in itself. Everyone who has been involved in the discussions needed effective listening skills so as to hear and understand each other’s requirements, and also needed to have the flexibility to adapt to different cultures and working practices. After several months of intensive interactions, during which we explored the various ways that the different research communities were organizing their data, we established common grounds and terminology for the required data services. In the course of this process, we also found creative ways to bridge the gaps that sometimes arose between the services that were wanted and the services that it would be feasible for EUDAT to develop.

It was largely left to the various research centres that were represented in these discussions to choose the most appropriate ways to interact with their broader research communities in relation to the goals that were determined for the EUDAT services. However, EUDAT experts were also involved in various activities to foster a wider interaction between the research communities and the project. For example, EUDAT researchers participated in domain meetings, organized user forums involving more community experts, and organized workshops and training events. This type of multi-level interaction process takes a large amount of time, and consequently there are limits to what each individual researcher can contribute, particularly as every core researcher is under enormous pressure to publish and must therefore rely on mediating community experts being active instead.

Designing cross-disciplinary services

While the various EUDAT task forces (which brought together community experts, service provider experts and technologists) were working on developing the initial data services, the close interaction enabled us to realise that we had overestimated the degree of organization of the data within some of the research communities, including the larger ones. Also, in the on-going development process, we became aware that there is heterogeneity within research communities in regard to data organization principles, the technologies being used, and the level of awareness about technologies.

It is hard to engage leading researchers from the various communities in this time-costly interaction process (due to other demands on their time). Consequently it is necessary to rely instead on mediating experts who have a deep understanding of the data services needed within particular research domains.

A concrete example of this arose in the process of designing the “safe replication” service. Initially we assumed that the ESFRI communities would all have proper data organization at a logical level in place (with routinely assigned metadata and persistent identifiers for digital objects and collections, and so forth). However, it became obvious that many communities, and centres within communities, were in the midst of working on establishing such proper data organization, and consequently many community centres and smaller departments were not in a position to take up the safe replication service in its full version at the time. In a joint agile discussion process – which was non-linear due to the different opinions involved - we eventually agreed on implementing four “flavours” of replication, while maintaining the notion of “EUDAT’s data domain as a domain of registered and described data2”: full replication (iRODS to be installed), light replication (GridFTP to be installed), custom replication (adaption to Fedora, D-Space, Mediawiki, etc.), and simple store service (addresses projects and individuals).

Community emancipation and the CDI

Much of the work in EUDAT so far has been based on a model where the EUDAT data services are hosted by “big data centres”, while researchers from other organisations contribute to the development in a collaborative and distributed way. However, some future EUDAT services might depart from this model and give a more prominent role to researchers in both the design and operation of the services. For example, the model for one of the services currently under discussion (which we are calling “semantic annotation” for now) will be different. This service will allow researchers to check the correctness of data based on specified knowledge sources and, if necessary, to annotate the data (for example, with corrections or references) before uploading it into a data repository. The “semantic annotation” service is being realized as a plug-in that will be used not only with the simple store functionality of EUDAT, but also with any other data upload service that is offered within the research communities. “Semantic annotation” is still a common data service, but here the task of implementing it focuses on providing code that can be used across various disciplines independently of other EUDAT services. This step can be seen as emancipating the communities when it comes to providing services - although the challenge of maintaining code that is distributed in this way still needs to be solved.

Lessons learned

A consortium can only consist of a limited number of institutions – this makes it a challenging task to transmit essential messages effectively between the project and the research communities on a broad scale.

To develop data services that will be embraced by research communities, it is necessary to involve the communities in the process. EUDAT’s work to date has shown that it is possible to engage with research communities effectively,although it is a time-consuming and non-linear process to achieve mutual understanding and come to agreements between the communities, and subsequently also reach accord between the research communities on the one hand and IT centres on the other. It is important to be aware that engaging effectively in this way with all the stakeholders requires some seed money, particularly in order to fund having the right experts available.

In managing a project such as EUDAT, another point that must be considered is how the centres that represent each research community will communicate efficiently with the community at large. It is important to have channels that enable information from the project be transmitted to the communities effectively, and also to have clear means for transmitting the wishes of the communities to the project consortium. Naturally it would be ideal if each community could have a much larger representation in the consortium from the very beginning, but that would result in an overly large consortium and hence cause management issues.

To promote communication between the project and the research communities, EUDAT has been involved in many events at community level: participating in discussions, giving training courses and demonstrating services. However this has not always lead to the sufficiently agile interaction with broad groups of researchers that is needed for effective changes and developments. More flexible funding models would be required to be able to react in a more agile way.

Where possible, EUDAT has also tried to learn from the experiences of other projects. For example, thanks to excellent initiatives such as Bamboo, we have noticed that there are limits to what is possible for a productive interaction process. Bamboo organized a series of large and well-funded (Mellon Foundation) conferences of humanities researchers worldwide. These certainly had an enormous effect on raising awareness worldwide. However, they turned into a platform for presenting the myriad different approaches and solutions used in digital humanities research, rather than serving as a forum for identifying common conclusions and directions, and working towards cooperative shareable solutions.

To engage more intensively with the research communities, more flexible funding schemes would be necessary.

Like other initiatives, EUDAT has now had substantial experience in actively involving individual users and whole communities, not only in terms of gathering preliminary ideas, but also - and this is decisive - in continuing an intensive agile interaction process all the way from making initial decisions on the types of services to be provided up to the actual delivery of those services. For EUDAT, this process took place in a cross-disciplinary setting. The whole process certainly required patience and respect from all the partners. A key ingredient in the success of the process was having clear goals − such as defining concrete services that would be valuable to the majority of the research communities − established from the start so as to avoid degenerating into circular unproductive academic debates at the risk of simplifications.

In order to be successful, development processes such as these need a rich flow of communication between the different actors (which was something we underestimated in the beginning). Appropriate communication channels are also vital, and it is important to be aware that what is appropriate can vary depending on the type of people involved. There needs to be a core group of committed people chosen so that all the actors are adequately represented and there must be strong and respected coordination and management to ensure progress and success.

Experience from EUDAT and other initiatives has shown that actively involving the user communities in all the steps is a must for infrastructure projects, and increases the likelihood of a broader uptake of the developed services as they are made available. For the research communities, there is indeed much to be gained early on from such collaboration, for example, cross-fertilization and harmonization with respect to concepts and knowledge, organization and technologies. Enormous insights into available technologies can also be gained, if so desired, and later on the resulting solutions for imminent data challenges are also of great value. For communities to participate effectively in such projects, a level of commitment for a certain time period is required, along with an awareness that a collaborative data infrastructure is an appropriate framework for addressing today’s data challenges. The community interactions and discussions fostered by EUDAT have resulted in an additional benefit: strong support has been established in Europe for the RDA project from the beginning. This is based on the insight that, as research communities are generally organized globally, we must have global agreements about the components that are required when building common (worldwide) data services.