What are Knowledge Graphs and How Do They Enrich Data Lakes?
Knowledge graphs are more than a buzzword. They are what we are using when we type “restaurants near me” in Google Search. They are also partly the reason why Google Search gets us relevant results, summarised content, and unexpected-yet-related bonus information. Thus, it comes as no surprise that they have their application in agriculture and food safety, as well.
By helping extract, interconnect, and provide knowledge and data-driven insights as answers, knowledge graphs help structure processes and boost decision-making. In this article, we will explore what knowledge lakes are and how they improve metadata management in data lakes.
What is a Knowledge Graph? What Types of Knowledge Graphs Exist?
A knowledge graph (KG) is a network that semantically describes entities in a domain of interest and their relationships. There are two main types of knowledge graphs:
- – Open knowledge graphs are published online (Wikidata is a good example).
- – Enterprise knowledge graphs are internal to a company (e.g., describing products and customers).
Depending on their contents and focus, knowledge graphs can also be:
- Encyclopedic knowledge graphs provide general knowledge about the world.
- – Commonsense knowledge graphs describe daily-life concepts.
- – Domain-specific knowledge graphs represent specialised knowledge in a certain domain.
Typically, knowledge graphs contain textual information. However, multi-modal knowledge graphs also exist. They contain facts in additional modalities, such as images or sounds.
Commonly Used Data Models for Representing Knowledge Graphs
From a more technical perspective, the two most commonly used data models to represent knowledge graphs are the RDF model and the Property Graphs model:
- – The RDF model is a recommendation by the World Wide Web Consortium. It represents facts as triples in the form <subject, predicate, object>, using globally unique identifiers (URIs) to refer to entities and relationships.
- – The Property Graphs model offers more flexibility by allowing each entity or relationship to associate with a label and a set of key-value pairs.
The SPARQL query language can be used to query RDF graphs, whereas Cypher, Gremlin, and G-CORE are languages designed for querying Property Graphs.
What are Some Notable Knowledge Graphs Benefits?
There are several knowledge graph benefits, such as advanced search and recommendations. Compared to relational databases, knowledge graphs do not require a fixed, pre-defined schema. That offers higher flexibility in many applications, especially when integrating data from multiple heterogeneous sources or representing incomplete and evolving knowledge. Moreover, with the use of standard knowledge representation formalisms, such as ontologies, they can facilitate semantic interoperability and enable reasoning.
There is also great potential for combining knowledge graphs and Large Language Models (LLMs), such as ChatGPT. A weakness of LLMs is capturing and accessing factual knowledge, which is what knowledge graphs provide. At the same time, LLMs can facilitate the construction of knowledge graphs via information extraction from text.
How to Create a Knowledge Graph?
There are a few ways to create a knowledge graph. While we will explore these methods in more detail in some future blog posts, for now we will leave you with a concise list:
- – Manually constructed, for example by domain experts or via crowdsourcing. That, however, has a high cost in terms of time and effort.
- – (Semi-)automatically constructed from different types of sources. In the case of text sources, the process involves steps such as named entity recognition, entity typing, and entity linking. In the case of structured sources, such as tabular, mappings can be specified using appropriate mapping languages.
Building a Knowledge Lake Management System for Enriching Data Lakes
In the scope of STELAR, a 3-year Horizon Europe project, a Knowledge Lake Management System (KLMS) is being developed. The main idea is to leverage knowledge graphs to improve and facilitate metadata management in data lakes towards making their data assets more FAIR (i.e., Findable, Accessible, Interoperable, and Reusable) and AI-ready (i.e., of higher quality).
Data lakes are centralised repositories for storing and processing large amounts of structured, semi-structured, and unstructured data, both raw and transformed, without the need to determine the purpose and the schema of the data in advance. That offers high flexibility, enabling data scientists to perform ad hoc, self-service analytics. The downside is that a data lake can degrade to a data swamp if not properly managed, making it harder to discover the right information for a given task.
To overcome this, STELAR is developing a platform that automatically collects metadata about datasets, workflows, and individual tasks, when a data science pipeline is executed. The collected metadata are stored in a metadata repository. Then, a knowledge graph is constructed using a designed ontology, as well as using appropriate mappings. The knowledge graph can be virtual or materialised, and essentially interlinks information about datasets and the workflows in which they are used or by which they have been produced.
Using search and question answering over the knowledge graphs, the data scientist can compare the performance of different models, as well as entire pipelines. Moreover, by associating performance metrics and dataset characteristics, the data scientist can extract insights that can further guide the design, configuration, and maintenance of pipelines.
Road to AI-ready Data Sets and Semantic Interoperability
NOTE: This article was written by Dimitris Skoutas from Athena Research Center. It was edited for clarity and by SEO guidances by the STELAR project’s WP4 leaders.