The need to describe data with metadata is well understood: the problem is how best to do it. There are many answers to that question which in itself creates a further problem: with so many standards to choose from, which one should I use to describe my data? With so many is use, which one(s) should I build my application to look for?
The Data Catalog Vocabulary, DCAT, became a W3C Recommendation in January 2014. Making use of Dublin Core wherever possible, DCAT captures many essential features of a description of a dataset: the abstract concepts of the catalog and datasets, the realizable distributions of the datasets, keywords, landing pages, links to licenses, publishers etc. But it’s clear that DCAT is not a full solution. For example, it doesn’t cover versioning or time and space slices; it does not relate semantically the dataset to organisations, persons, software, projects, funding…;it describes datasets, not APIs and so on. Other well-established and widely used schemas for describing data include CKAN’s native schema, schema.org, DDI, SDMX, CERIF, VoID, INSPIRE and the Healthcare and Life Sciences Interest Group’s Dataset Description vocabulary. These provide for discovery of datasets and - in some cases - contextualization (to ascertain relevance and quality) and action (access). Of the above only CERIF provides for provenance, although the W3C Recommendation PROV is also clearly relevant here. To emphasize the variety, the UK’s Digital Curation Centre - jointly with the RDA’s Metadata Standards Catalog group - manages an extensive catalog of metadata standards used in different scientific disciplines.
This variety presents a barrier to interoperability for many applications including the Virtual Research Environment (VRE) under development within the VRE4EIC project. The VRE draws datasets from multiple research infrastructures, many of which have data catalogs, and tries to cope with this diversity of methods of data description and data access. It does this partly by drawing on the CERIF metadata schema that provides a mapping of many other schemas. But what is the scalable solution? Where is the line between being flexible enough to meet the needs and preferences of different communities and predictable enough to allow meaningful communication between data publishers and data users?
An application may be able to handle specific metadata schemes or, more precisely, specific profiles of metadata schemes with predefined lists of allowed values, mandatory and optional properties etc. The European Commission, for example, has published a set of application profiles of DCAT that it recommends for communication with European data portals. This suggests a need for metadata publishers and consuming applications to be able to specify which metadata schemes are supported in a machine readable way and to validate data against such as scheme. This is orthogonal to whether the data is provided in JSON, RDF or XML.
A further problem in this space is vocabulary management. All the metadata vocabularies and profiles cited above are subject to different change management regimes. What is the right balance between being responsive to the community but stable enough to ensure trust in the vocabulary?
This workshop aims to clarify the steps needed to improve communication between data repositories and applications that use that data, such as virtual research environments. Applications may simply discover data or visualize it, manipulate it, discuss it, correct it, describe it republish it etc. The outcome may be a new W3C Working Group chartered to extend DCAT and determine how human and machine-readable metadata profiles are defined and made discoverable. A further aim is to explore how W3C can best support vocabulary development for a variety of communities.
Topics for the workshop include, but are not limited to:
- Approaches to dataset descriptions
- Experience of using DCAT and/or other dataset description vocabularies
- Defining and using metadata profiles
- Discovering metadata profiles
- Providing and using metadata in multiple profiles for multiple contexts
- Experiences of developing, managing and mapping vocabularies.
To ensure productive discussions, the Workshop will include sessions which are primarily technical, but grounded in business needs. The sessions will be conducted in English; we will do our best to accommodate special needs, but signing and continuous translation will not be available. We invite representatives from following communities to submit papers, although this is not intended to be an exhaustive list:
- Data catalog operators and users especially from Research Infrastructures
- Data managers/curators
- Virtual research environment developers and users
- Developers of data-centric applications