Petros Skoufis in Spotlight for STELAR: Data Science at the Core
In this episode of the STELAR project’s Data Stories 360° podcast, Tatjana Knezevic, Head of Exploitation at Foodscale Hub, engages in an insightful conversation with Petros Skoufis, a Data Science Researcher at the Athena Research Centre – the coordinating organisation of the STELAR project.
Petros, an expert in knowledge graphs and Artificial Intelligence, shares his passion for innovation and his journey into the world of data science. Throughout the discussion, we explore Petros’ expertise in Large Language Models and Generative AI, as well as his role in the STELAR project.
Read on as we examine the complexities of designing the STELAR project’s architecture, data catalog, and knowledge graph, and discuss the critical importance of data quality management.
Exploring Our Guest's Motivation and Background
Can you share a bit about your journey into data science? What inspired you to specialise in this field, particularly focusing on areas like Large Language Models and Generative AI?
I studied Electrical and Computer Engineering at the National Technical University of Athens, where I first discovered my passion for data science and artificial intelligence. Before joining STELAR, I had some research experience but little exposure to Large Language Models (LLMs) or Generative AI, as these technologies were not as prominent at the time. However, I always enjoyed solving problems using AI and finding ways to help people through technology.
After joining STELAR, there was this breakthrough when it came to LLMs and Generative AI, and I think at that point it was evident for me that this was the specific domain that I wanted to specialise in.
What I find most satisfying about this field is the fact that right now we are all witnessing remarkable advancements every day, and as a researcher, this is very fascinating because it means that you have to handle and tackle new open issues every day. I would say that this is like a paradise for a researcher compared to probably corporates, who might find the pace overwhelming because they have to adopt new technologies all the time.
I think this is the main reason I love Large Language Models and Generative AI at the moment: it is data science presented in a very interpretable and human-like way. It is much easier for someone to understand the results from LLMs and Generative AI solutions. I find it truly intriguing that I can quickly explain to business partners, or even my friends and everyday people, what I achieve with the technologies I use and my work in general.
As a Data Science Researcher at Athena Research Center, what is your specific role in the STELAR project and how does your work contribute to achieving its overall objectives?
I spend an equal amount of time on research and development, particularly on novel techniques and algorithms for data management. This is my area of expertise. My role in STELAR primarily focuses on designing, implementing, and validating methods to improve data discovery and information extraction.
For example, in our work with one of our partners, Agroknow, I specialise in helping the company extract entities of interest, such as foods and hazards, from unstructured information. In our case, this involves plain documents of product recalls from various data sources. I also contribute significantly to the knowledge organisation part of the data catalog.
In general, I assist with data discovery by developing data schemas, adding descriptions, and providing new data points for each dataset, making it easier for people to retrieve datasets or answer their questions regarding the data we provide in the catalog.

Overcoming Challenges in Implementing Generative AI
Are there any challenges in obtaining information from Agroknow to understand their daily operations and identify opportunities for optimisation?
Always when you have to bridge academia with business, there are some challenges. But I think having a consortium with business partners and academic partners cooperating is vital because this is the only way we can actually see what challenges a company faces when they have to adopt an academic solution.
I find it very interesting, and so far, our cooperation is working well because we have a clear challenge that we want to tackle. We have a problem that we can understand, and for them, it is a very important issue in their pipeline of processes. We are working towards solving it, and we got lucky, actually, because Large Language Models ended up being a good solution and a suitable substitute for their pipeline. We have overcome many challenges so far, which is a positive thing.
Since your team leads the design of the architecture, the data catalog, and the knowledge graph, what challenges have you encountered in developing these tools, and how have you overcome them?
I think challenges are part of our everyday activities – figuring out challenges and then tackling them. There are three main verticals when it comes to the development of the catalog where we constantly have to address some challenges.
The first one I would say is data. There is huge heterogeneity in the formats of data that we have to handle. We manage tabular data, textual data, geospatial data, time series, and these are all applicable use cases from our partners. To handle all this data, we need frameworks that support all these types of data, and then, even within each type of data, we need to support multiple formats and provide solutions for all these formats.
So if profiling is hard just for a single type of data, then imagine doing profiling for different formats and different structures. But there are no shortcuts in this specific challenge. However, we are focusing a lot on having quality conversion pipelines so we can convert between different formats in a way that does not lose any of the key values encapsulated in the data. We also do a lot of work in mapping between different formats and entity linking. For example, a lot of the work in the project focuses on entity linking, which is a problem that helps us connect different data about the same things but in different formats.
AI is the second challenge we face every day. We cannot stick to any technology. We actually need to develop code and software that is model-agnostic. This has actually helped us a lot, because we were quick to identify this need, which meant we did not have to remove part of the work we had already done. We had the model-agnostic architecture in mind and proceeded with this design.
This has also helped us focus on developing solutions and technologies that complement LLMs, work outside of these models, and help them perform better regardless of the specific model we use. This is incredibly important because it allows us to quickly switch between the best models at any moment, and I think this is something our partners really appreciate. In general, it is a great trait of our work.
The third challenge we constantly face is architecture. Again, like data, I think this has no shortcuts. We need to have pipelines that solve very diverse problems. We set a very high target from the beginning regarding the modularity and integration needs of our platform and how we designed it.
Our partners have been very helpful in this regard because having people who can give you feedback from the very beginning on possible integration issues or share how they would proceed with integrating their pipelines and processes into one platform is crucial. It helps you come up with a very effective design from the start, which you will tweak and apply their feedback to, but it is still a much better starting point.
Data quality management is a crucial part of STELAR. How does your work ensure that the data used in the project is reliable, accessible, and actionable?
When you have a platform that uses data from different vendors and providers, it is always an issue ensuring the value of that data and evaluating its worth. However, we work on it a lot by trying to provide transparency, explaining the variables, and explaining the datasets we use.
We have a lot of tools, such as the profiler, that try to provide descriptions to help people better understand those datasets. We also have another part of the work focused on data cleaning and data imputation, to fill in the gaps of potentially lost or incorrect data. So, regarding reliability, a lot of work is being done in this area.
When it comes to data discovery, we try to leverage all the work being done in descriptions, data enrichment, and data imputation to create complete summaries and descriptions of datasets. We organise them in a way that makes them easier to retrieve and helps answer questions related to those datasets. This is crucial because, in our era, it is not that we lack data – it is the fact that we have so much data that it is almost impossible to find exactly what we are searching for.
So, in a catalog that hosts a vast amount of data, we need to make sure that we have the processes and pipelines to help people figure out what they need and get the best value from the data sources.
Finally, when it comes to making data actionable, one solution to this problem is the open-source aspect. We ensure that we provide frameworks, open-source solutions, and open-source tools, so it is not only about what we build but also about what others can develop based on our solution. The transparency we provide allows everyone to use our solution freely and more safely.
Then, all the work we do in entity linking, profiling, and enriching data helps people take part of the data we provide through the catalog, integrate it into their pipelines, and connect it with their logic, knowledge schemas, or processes. This reduces complexity and allows them to exploit the data for their purposes, which is ultimately what we want in STELAR.

AI-Driven Innovations in STELAR
How are transformative technologies like Large Language Models and Generative AI integrated into the STELAR project, and what value do they add to its objectives?
The STELAR project is quite an interesting example of how quickly technology can evolve, because when it was designed and launched, these technologies either did not exist or were not widely known. We did have some early versions of these technologies, which we had included in the proposals, but there was no discussion of LLMs or Generative AI at the time.
It was a topic being discussed by only a few, and then, just months after the project began, we all witnessed the breakthrough of these technologies. This transformative event completely changed the landscape, with more and more people suddenly interested in implementing these models and exploring their capabilities.
We were fortunate enough to recognise the potential of these technologies and quickly decided to adopt them within our project. This is a great example of how dynamic a project can be, especially when you have a consortium made up of both academic and business partners. Academic partners stay at the cutting edge of technology because it is the way to achieve academic success. However, we were lucky to have business partners who quickly understood the value of these technologies and gave us the green light to explore and implement solutions. Although the project’s overall scope remained unchanged, we were able to adjust the means and tools we used to achieve that scope.
LLMs are beneficial in various ways. Firstly, they have allowed us to tackle many tasks within the project more effectively. While I will not go into the specifics, they have proven to be better solutions in many domains. For instance, in my area, knowledge extraction, LLMs have been remarkable, and they are gradually gaining traction in new areas such as time series predictions, continually improving with time.
One of the advantages of LLMs is the way they operate. By providing instructions similar to how you would direct a person, they can perform tasks with great efficiency. This feature allowed us to convert business partner requirements into actual solutions more easily. We were able to integrate the processes that people were previously carrying out manually and translate them into artificial intelligence-driven solutions.
Furthermore, LLMs have helped us explain our results more clearly. The outputs are presented in natural language, making it easier to communicate our findings, even in the prototype phase. We were fortunate in all these aspects, but it is important to note that nothing comes without its challenges.
The speed of technological advancements was one challenge, but another was the caution required when developing solutions where the LLMs make decisions or provide results. This was an entirely new field for us, and we understood the risks as we moved forward. Fortunately, we had strong expertise in both academia and business to help manage these challenges. In essence, LLMs have added significant value and performance to the project, but at a cost that we had to manage internally.
Could you explain how data profiling tools work and their impact on enhancing data discovery within the STELAR project?
Data discovery largely depends on the availability of comprehensive data descriptions. The role of a profiler is to assist in automatically producing these descriptions, complementing and augmenting those provided by data owners. This is particularly important when manual descriptions are either missing or insufficient.
To improve the quality of these descriptions, we take two main approaches. The more traditional method involves calculating statistics on data sets. These statistics, such as value uniqueness, frequency, minimum, maximum, and average, help us understand various aspects of specific columns or areas in the data set. They can also assist in imputing missing or incorrect values.
The second approach involves semantic type annotation. Instead of simply calculating data on values, we focus on disambiguating and understanding the meaning of tables and columns. We provide actual mappings to an ontology or free-text descriptions. This process allows us to present users with a complete view of a data set, so they do not have to explore it manually.
These efforts also help us connect similar data sets, link them to a common ontology, and create more holistic solutions for specific problems or domains. By employing LLMs, we have achieved significant progress in presenting these descriptions and profiling results in a way that is easily understandable by experts. Our ultimate goal is to make these descriptions comprehensible to everyone, not just experts, so that all users can understand any type of data set.

What the Future Holds for AI and Data Science
The field of AI and data science is evolving rapidly. What emerging trends or technologies do you think will have the most significant impact in the next five years?
The five-year time span is difficult to predict due to the fast pace of technological advancements. However, there are key emerging technologies and frameworks that are becoming increasingly important. One of the major trends is agentic AI, which involves breaking down tasks into sub-tasks and using LLMs to help solve these sub-tasks in a way that contributes to achieving the larger goal. This framework is gaining popularity because it simplifies the process of explaining tasks to the LLMs and does not require extensive coding knowledge. No-code resources are emerging to help individuals with minimal coding experience use these models to solve problems.
Another promising development is the use of small LLMs for edge computing, such as on mobile devices. This innovation will enable greater adoption among everyday consumers, not just businesses, as it will help address resource constraints without the high costs typically associated with these models. Small models are improving daily, and the Athena Research Center is actively experimenting with them to explore their potential.
The fact that those models get better every day in understanding voice and actually producing voice is crucial because, again, this changes a lot of the paradigm of views when it comes to mobile and edge devices. But this is very interesting because it is another level of abstraction. It makes for a more fluid interaction with humans, and it opens up a lot of opportunities in edge computing use cases.
Finally, I would say that the support for new types of data is crucial. Like when we started with LLMs two years ago, we were mostly focusing on having them understand documents. Then there were new LLMs that were able to understand things like pictures or videos or generate that kind of data. Now, we can go into more niche domains, like time series.
In our project, for example, we have pictures of wavelengths from different satellites. This is a very niche domain. There is no expert LLM in this field right now. But imagine what can happen if LLMs get into this domain, which they will. Because I think classical data at this point is almost covered with the solution. So, in general, I think we can be very optimistic about the adoption and the use cases that LLMs will solve in the next few years, less than five years, I guess.
Conclusion
Petros shared valuable perspectives on the evolving role of AI, data, and technology in shaping the digital world. His insights highlight the importance of bridging research with practical applications to drive meaningful innovation.
More STELAR episodes are coming your way! Stay connected by following our YouTube channel and Blog. See you soon!
Leave A Comment