How Data Catalogs Drive Progress in Agri-data Management?
In today’s complex data landscape, data catalogs serve as indispensable navigational tools. They empower organizations to efficiently organize, discover, and leverage their data assets, thereby fostering informed decision-making, enhancing collaboration, and ensuring regulatory compliance within modern data management systems.
In the following sections, we will explore the importance of data catalogs and their impact in agriculture.
The Practical Value of Data Catalogs
Finding the right data for a processing task is an open challenge for data scientists and analysts. What they need is up-to-date, rich, contextualized metadata, i.e., data providing information about the actual datasets. Metadata management is essential in extracting value from data to support business models, improve performance of analysis and operations, and boost innovation in the data economy. Towards this direction, data catalogs act as catalysts by:
- offering users insights about data and incentives to work with them;
- enabling data discovery and comparison in terms of quality and fit for purpose;
- reducing data preparation cost for ML processing pipelines and analytics; and
- promoting compliant data handling and usage.
Apart from data warehouses, repositories, and marketplaces, data catalogs are essential in data lakes. Data lakes are large-scale repositories for storing and managing diverse data types, facilitating scalable analysis and insights generation.
Typically, according to Gartner, a data catalog “maintains an inventory of data assets through the discovery, description, and organization of datasets”. Catalogs are metadata management tools that support curation of data by providing inventory and discovery capabilities in an integrated ecosystem. The catalog provides context to enable data analysts, data scientists, data curators, and other end-users to find and understand datasets relevant for their analysis or process.
Diversity of Metadata Types
In general, metadata in data catalogs can be of various types:
- Technical metadata: include information about the structure, organization, and format of the stored data, such as the location (e.g., path, URL), data source, row or column count, data type, and attribute schema.
- Governance metadata: specifies information on how data is created, stored, accessed, and used. So, it includes copyright/license information, data classification, ownership information, who can access the data (user roles) and for what purpose (user restrictions), etc.
- Operational metadata: keeps track of the flow of data throughout its lifecycle to support traceability and lineage. Metadata includes workflow specifications, status updates, dependencies, code, ETL logs, quality issues, runtime, etc.
- Collaboration metadata: contains insights based on conversations around data, like data-related comments, discussions, tags, issue tickets, etc.
- Quality metadata: includes quality metrics and measures about the dataset, such as its status and freshness, as well as in workflows/tasks/tests this dataset is involved, like their statuses and metrics (e.g., precision, recall).
- Usage metadata: records information about how much a dataset is used and can reveal access patterns. Typical such metadata includes view count, popularity, rating, top users, frequency of use, and more.
Understanding the Distinction: Open vs Proprietary Data
A catalog may list open and/or proprietary data assets. Information about open data sources may come from other public Catalogs or APIs, like SpatioTemporal Asset Catalogs for Earth Observation data collections, geodatasets regarding Points of Interest or road networks extracted from OpenStreetMap, open vocabularies and ontologies, etc. In contrast, proprietary datasets are visible only to specific users or to users having certain roles (e.g., farm advisors or food inspectors in the agri-food domain).
Access control may be applied even at the data asset level, allowing a part of its description (e.g., containing basic metadata) to be publicly discoverable, whereas the rest (e.g., more detailed metadata or quality indicators) may be visible only to registered users according to their access rights. Data publishers can configure such information (e.g., license, access rights) during publishing, thus data owners retain full control over who has access to their data.
Recognizing the Unique Needs of Each User
Data analysts, data scientists, and data curators can interact with a Catalog to search for datasets. Some typical questions they may ask:
- Where can I find data for my processing task (e.g., crop classification)?
- What filters (e.g., keywords, time span, spatial extent) could I apply in searching for relevant data?
- Are there any recommended datasets in the lake for the task at hand?
- Who should I ask to get access to the data?
- What is this dataset about? Is there any schema, documentation, or profiling information to get more insights about this data?
- How was this data created? How frequently is it updated?
- How should I use the data? Which columns are relevant? What tables should I combine?
Advancing Data Management with STELAR's Data Catalog
The Data Catalog in STELAR is being used for publishing, listing, and searching all available data in the KLMS. We have set up this Catalog based on the open-source software CKAN. The STELAR Catalog supports both human-readable and machine-readable metadata, including metadata computed by the STELAR Data Profiler, such as the dataset’s temporal or spatial extent, quality indicators, etc.
The Catalog provides an API both for publishing datasets and searching for datasets. Each published dataset is assigned a unique, persistent identifier. Thus, when a dataset is used in a workflow, it is referenced with that identifier, instead of its location in the data lake, which may change over time. The Catalog only stores metadata about the datasets in the lake, whereas the actual data is stored in MinIO, an open-source storage system compatible with Amazon S3 and Kubernetes. Thus, metadata management is separated from data management.
Simplifying Agriculture Data Management Through User-Friendly Navigation
Having in mind that one of the main challenges in adopting a data management system in agriculture is lack of infrastructure and technical skills, STELAR project has the objective to make agrifood data easier to use. How does STELAR ensure an optimal user experience?
To enhance metadata management and data discovery, we support an extended schema of the metadata attributes regarding datasets, workflow executions and task executions, which also enables linking between datasets and task executions. Our search API supports keyword search, faceted browsing, and result comparison for easier data discovery. Most importantly, all metadata available in the Data Catalog can be exposed as a Knowledge Graph using RDF documents serialized through mappings to the KLMS ontology.
Furthermore, we have been developing a customized, comprehensive web-based GUI that extends the Data Catalog’s capabilities for search, navigation, comparison and ranking among available datasets to facilitate users of the KLMS in dataset selection. This GUI emphasizes relevant filtering criteria, such as geospatial regions, time intervals and automatically recognized named entities (e.g., food products, hazards, crop types), allowing to fully leverage the automated metadata extracted by the STELAR Data Profiler. It also supports dataset comparison and ranking according to multiple criteria and user preferences, rather than simply presenting lists of search results.
Conclusion
In today’s agriculture, precision farming is revolutionizing how farmers work. Using satellites, drones, and robots, all powered by data, farmers can optimize their practices like never before. However, managing this wealth of data effectively is essential for maximizing the benefits of these technologies.
Don’t miss out on the latest developments in our Blog. Connect with us on LinkedIn to stay in the loop!
NOTE: This article was written by Kostas Patroumpas from Athena Research Center. It was edited for clarity and by SEO guidances by the STELAR project’s WP6 leaders.
Kostas Patroumpas
Kostas Patroumpas is a Research Associate at the Information Management Systems Institute at Athena Research Center (Greece). His research interests relate to geospatial data stream processing, trajectory data management, linked data, and geographical visualisation of complex phenomena. He has published more than 30 articles in international journals and conferences.