Objective 1: Facilitate and improve data discovery and reuse

Data spaces comprise large amounts of data of various types, schemas, models, languages, and formats, from diverse sources, and with varying quality. Hence, finding the right data is challenging. Data discovery typically relies on metadata, which is often missing or incomplete. High-quality metadata requires a significant amount of manual curation, which is costly and cannot scale to the constantly increasing data volumes. Moreover, the typical metadata attributes provided by data catalogs do not provide sufficiently detailed information about the actual content and quality of the data. All these restrict the search criteria and prevent users from assessing the suitability of a dataset for a given task.

STELAR will empower users to efficiently and timely discover the right data for their needs by automatically enhancing data descriptions to include: (a) additional and more fine-grained metadata automatically extracted from the content; (b) various quality indicators, including domain-specific ones; and (c) data summaries, which can increase energy efficiency of analytics over large data volumes.

Objective 2: Support data linking and interoperability

Interoperability relies on vocabularies and ontologies that semantically describe data and APIs. In data spaces, several data models, ontologies, and vocabularies co-exist. Hence, mechanisms for automatically mapping and translating between different models and representations are required. Moreover, information about the same entity is often fragmented across multiple, diverse sources, and needs to be linked and aligned. For instance, entities extracted from unstructured sources have different name variations, spelling errors, or different languages; geospatial and time series data coming from different sources have different resolutions.

STELAR will employ linked data technologies to semantically enrich data descriptions and interlink entities across sources, addressing schema – and instance-level matching, spatio-temporal data alignment, and correlations. Our goal is to: (a) reduce the manual effort required, by automating the configuration of workflows; (b) achieve energy efficiency and scalability when dealing with large entity collections and data volumes; and (c) ensure robustness of the proposed techniques in terms of efficiency and effectiveness.

Objective 3: Improve automation and reliability of data annotation and synthetic data generation

Machine Learning, especially Deep Learning, requires large amounts of labeled instances to robustly train models. AI systems require semantically annotated data from their environment to make reliable decisions and take appropriate actions. However, data annotation and labeling require domain knowledge and expertise, implying a high cost in terms of time and effort of professional domain experts. This prevents many applications from benefiting from the use of AI, due to the lack of training datasets. Moreover, data bias, if not detected and mitigated, can cause significant adverse effects in the performance of an AI application.

STELAR will increase the automation and reliability of data annotation and labeling, contributing to the availability of AI-ready data. We will improve the quality and quantity of data for downstream AI applications, while minimizing the involvement of domain experts, by: (a) leveraging semi- and self-supervised learning, as well as active learning, to automatically annotate and label data; (b) offering mechanisms for explainability and for bias detection and mitigation; and (c) providing black-box and open-box synthetic data generators to overcome label and data scarcity.

Objective 4: Validate, evaluate, and demonstrate the developed tools in real-world use cases

The food supply chain covers all stages from production to transport, distribution, marketing, and consumption, and is witnessing increasing digitalization and reformation due to Big Data, AI and 5G technologies. It involves various stakeholders, including producers, advisors, machinery manufacturers, processing actors, inspectors, certification authorities, insurance companies, governmental agencies, all of which have an interest or even legal obligation to exchange and share data. These users need to discover, share, and reuse diverse types of data, both open and proprietary, as well as structured, semi-structured and unstructured, coming from many different sources, with varying types and levels of metadata descriptions and quality.

We will evaluate and showcase how the STELAR KLMS deals with data management gaps and challenges in the agrifood data space through three use cases: (a) risk prevention in food supply lines; (b) early crop growth predictions; and (c) timely precision farming interventions. These cover different stages of the food chain, involve and combine different types of data, and address different user needs.

Objective 5: Maximize impact

To maximize impact, it is imperative to identify and reach the right target groups, through the right communication channels, and with the right messages, and to liaise with related projects and initiatives. Moreover, appropriate and effective exploitation plans and actions are required to ensure the future adoption, sustainability and scaling of the project’s outcomes. We will develop a clear strategy for sustainability, growth and innovation to ensure long-term success of STELAR outcomes, including and emphasizing open science practices and open source tools.