Speaker
Description
Large language models (LLMs) have shown promising capabilities in several domains, but “Inherent Bias,” “Data Privacy & Confidentiality,” “Hallucinations,” “Stochastic Parrot,” and “Inadequate Evaluations” limit the LLM’s reliability for direct and unsupervised use. These challenges are exacerbated in complex, sensitive, low-resource domains with scarce large-scale, high-quality datasets. Therefore, this work introduces in-context synthetic data generation through LLMs as a potential technique to mitigate the Bias in machine learning (ML) pipelines and scarce data challenges by undertaking two real-world use cases in the healthcare domain (Mental health) and Agri-Food Sapce (Food Hazard Identification). In this context, this work presents expert-annotated Food-Hazard and Conversational Therapy (MI dialogues) datasets developed through the joint effort of LLM and humans in the loop. The data generation process incorporates accurately engineered prompts through cues and tailored information to ensure high-quality dialogue generation, taking into account contextual relevance and false semantic change. Both datasets are comprehensively evaluated through binary and multiclass classification problems that uncover the ability of LLMs to reason and understand domain intricacies. Our empirical findings demonstrate that the data generated through this rigorous quality control process is both plausible and substantially beneficial in enabling ML techniques to address the targeted biases, thereby supporting the use of LLMs for supervised, task-specific applications in sensitive domains like mental health. Our contributions not only provide the MI community with comprehensive datasets but also valuable insights for using LLMs in plausible text generation in multiple low-resource domains.