Vivek Kumar in Spotlight for STELAR: Bias Detection in AI
In our third episode of the STELAR project’s Data Stories 360° podcast we hosted Vivek Kumar, a multifaceted researcher specialising in Natural Language Processing and a member of the STELAR consortium from the University of the Bundeswehr Munich. Vivek shared his expertise in bias detection, mitigation, explainability, and synthetic data generation.
Tatjana Knezevic, Head of Exploitation from Foodscale Hub, one of the partners on the STELAR project, interviewed Vivek with the aim to understand sources of data bias and label scarcity issues.
Read on to find out how his research is applicable and vital for many facets of agriculture, due to smart agriculture becoming a norm and, thus, the need for us to understand data bias.
The Professional Background of Vivek Kumar
Could you tell us about your background and what inspired you to specialise in natural language processing, particularly in healthcare and clinical applications?
It is no secret that AI is the new oil, revolutionising the entire landscape of current research and application-based industries. My motivation was to use AI in some real applications where I could leverage it for the greater good. Implementing AI in ways that assist or even take over certain tasks from clinicians and practitioners to reduce their burden and improve health management was a primary focus for me.
I started with a particular interest in clinical and psychological applications of AI because these fields are still exploring AI’s potential. They are crucial domains but often face resource constraints, making it challenging to assess AI’s impact both quantitatively and qualitatively.
Is this motivation one of the reasons you joined the artificial intelligence and machine learning group at the University of Munich?
This is one of the reasons, but in particular, I was very inspired by Professor Eirini Ntoutsi’s work. She had a wide-ranging scope of work and had so many projects in terms of developing reliable and interpretable AI, which is essential at this moment. We have to move out of the black-box experiments and get to something more interpretable and reliable.
As I mentioned before, I am more into implementing things based on real-world scenarios. I found Professor Eirini’s work very much on point and relevant, and that was also a motivation to join. Of course, it is a great place to work with such a disciplined task force and outstanding researchers.

Role in STELAR
The University of the Bundeswehr Munich is leading WP4 on AI-ready data within the STELAR project. Could you elaborate on your specific role and responsibilities in this work package?
I am coordinating STELAR for the University of the Bundeswehr Munich under the supervision of Professor Eirini Ntoutsi. Our work package focuses on AI-driven and AI-ready data. One major challenge we face is bias, often stemming from insufficient data or inherent imbalances within the data.
It is very imperative that the data we take, regardless of the modality—be it image data, speech, audio, text—is available. If you want to have reliable AI system development, it is very imperative that the data at hand should be balanced in some capacity.
By that, I mean the data should be free from several flaws. For instance, the data should not be too imbalanced; it should not be too small because deep generative models require a lot of training data to sufficiently understand the nuances of the domain– understand the polarity of the domain.
Work Package 4 works in this direction to identify the biases that could lead to wrong predictions or non-reliable AI outcomes throughout the entire end-to-end pipeline. We identify the potential sources that could negatively influence the machine learning process or the process of classification or prediction.
Concretely, we are developing an end-to-end pipeline that, at all stages of the machine learning process, will be able to identify biases. Later, we take those biases into account and further mitigate them to generate a responsible AI-based system. The focus is currently on the agri-food space, so there are some biases that come particularly from this domain, which are important to address in order to develop reliable classification systems.
How does your expertise in natural language processing and fairness contribute to the goals of Work Package 4, especially in bias detection, mitigation, and synthetic data generation?
You have to have a certain understanding of the domain to grasp the criticality and the aspects of how a bias could manifest in the machine learning pipeline or in the data itself. My expertise has been based on domain adaptation.
What we do is take into account all the available world resource knowledge by means of knowledge graphs or fact triples, and we inject that knowledge into the machine learning pipelines to make the model more intelligent.
In STELAR specifically, I am developing a knowledge base or an extra knowledge resource that could enrich the domain. I have also been working on generating a lot of data.
For instance, in the clinical field, there are many areas where data is very scarce, such as psychology or mental health. I bring my expertise in generating the data, annotating it, and scraping data from various sources to enrich the domain. First, you identify the sources of bias based on your domain understanding, then you enrich the domain by generating data so that your model can train properly and adequately.
This helps avoid overfitting, underfitting, or inducing bias in the process. Synthetic data generation is integral because it provides a resource that contributes to the research community and leads to further research. So, synthetic data generation is a significant part of mitigating bias and also serves the community by fostering further research.

Challenges in Data Collection
We are taking data in the scope of three pilots. We have pilots for risk assessment, where we are taking data from Agroknow, but we are also working with two other pilots, with Abaco and Vista. Are there any specific data challenges that you encounter during this process?
I would say in agriculture, when we talk about the modality of the image or videos, it is much more readily available because there is a lot of satellite imaging and remote sensing capturing a lot of data. I would not say all that data enriches the domain, but we do not have that kind of scarcity of data compared to text data.
For instance, if you want to generalise something regarding certain allergies or cases where contamination could lead to medical conditions, you need a lot of text data to understand the nature of that hazard from the food.
In particular, I have felt that the text data in agriculture is very scarce and needs to be enriched to have some AI implementation in place to help the use cases. In STELAR, as you mentioned, Agroknow is developing that kind of data and helping all the partners gain access to it and leverage it for research.
For example, one use case is to identify the sources of bias and do hazard classification and identification in food incidents reports to have some hazard elements or allergy components.
It would be nice to see how this entire use case manifests at the end of the project and what we can develop in terms of a robust end-to-end pipeline that can predict biases in the domain. This can help generalise some aspects that come from the food industry.
Innovative Methods and Tools
What can you share about the innovative methods and tools your team is developing for bias detection and mitigation in AI models?
We are developing an end-to-end pipeline that takes into account the modality and the domain peculiarities. All bias manifestations that can occur at different stages, such as if the data is unbalanced or there is an imbalance bias data case, if your text sequence is too long and your model cannot process those long text sequences, you suffer from semantic loss, which is another kind of bias leading to contextual bias.
There are biases based on the region or demography or particular intolerance. For example, people in India are more tolerant to lactose compared to European people. So, your products containing dairy might be tolerant for some people and not for others, causing allergic reactions. These biases depend on the data source and can be induced. This framework should identify those biases so that proper mitigation strategies can be applied to achieve the final goal of developing responsible AI methods.

Market Availability and Commercialisation
Are there similar solutions already available on the market, or is this still in the research and development phase?
I would rather say it is more in the research phase. We have seen AI bringing a revolution in precision farming and agriculture, but bias is a crucial aspect that needs to be considered at all times. Especially for text, I have not seen much work done. Of course, several parallel works are going on, but as far as the existing literature is concerned, there is not much to go off. The text data in agriculture is very low resource, meaning there are fewer lexical and textual resources available, making it still under significant exploration.
As for commercialisation, some products I have seen are limited in scalability, development, reuse policies, or by licensing limitations. These constraints exist. A fair approach would be to develop something more for public use, more GDPR-compliant public domain things, catering to more disruptive research ideas for fellow researchers. We are working in this direction, and all our results and outcomes are GDPR-compliant and available in the public domain. We focus on not commercialising aspects that hinder distribution but keeping it open for more people to access.
I think it’s a constant process. If you make things more publicly accessible, you get feedback from time to time and keep yourself updated with the current research as well. In certain parts of the European Union, such as our existing laws and project regulations, certain information must remain exclusive to the project, while other aspects require public documentation. We are following this protocol.
Benefits to End Users
Can you mention some benefits that end users of the STELAR knowledge management system can expect, particularly for labelling satellite images into different crop types?
The end users of the STELAR KMS should be able to utilise the tool for various reasons. For instance, it can be used in precision agriculture, enabling accurate, up-to-date information about different crop types. It can also be used for resource management, such as water, fertilisers, pesticides, and how crops are distributed. Yield prediction is also important due to weather changes and shifts in the growing seasons.
For instance, if the summer came in June last year, this year it could come a bit later. So, we see a change in the time shift that also hampers the development of AI-based systems. Yield prediction is an important aspect, and we are considering the temporal aspects of crop development phases to mitigate this kind of situation, which could lead to potential bias.
One of the important aspects is that it can also be used for land use analysis. Not all fields marked for one year stay in the same shape next year because we do mixed farming, crop rotation, and various agricultural practices that ensure the lands do not remain unchanged over different seasons. Land use analysis is an important aspect to understand and it is interconnected with yield production.
It also helps in governmental policy making support, providing valuable data which government entities can use for decision-making in terms of food security, agriculture policies, and subsidy allocations. Of course, it contributes to further research and development; this kind of data enriches the domain, motivates others to explore this direction, which is more grounded and applicable to real-world scenarios, adapting to further changes.
These are the benefits and advantages of developing this system. As an end goal of the STELAR project, we should be able to facilitate these things, cater to the targeted people, and make the tool user-friendly.
Will the tool be user-friendly, or will it need some kind of training to be delivered to the end users?
The effort will be to make everything as seamless and user-friendly as possible, with a GUI. But given that we are considering a lot of components—yield prediction, crop prediction, data processing, bias assessment, risk analysis—these components are complex in nature, so we can expect a certain level of complexity.

Looking ahead, what are your hopes and expectations for the outcomes of the STELAR project?
Since the STELAR project is focused on agriculture, the foremost priority is to contribute valuable advancements to this domain. It would be nice to have resources that enrich the agriculture domain holistically, addressing the various challenges that the industry faces—from weather and land demographics to temporal aspects and market requirements.
I hope to develop all these together with the STELAR team. We are on a good track and in the process of aligning all these research ideas and components together. I think we will have a comprehensive tool, interface, and system in place by the end of the project, which could be very helpful in understanding the challenges of this particular domain, the growth of crops, land acquisition, distribution processes, and of course, contributing to policy making and decision-making processes as well.
Conclusion
This concludes our conversation with Vivek Kumar and wraps up the third episode of our podcast.
More STELAR episodes are coming soon, so be sure to follow our Youtube channel and our Blog. See you soon!