Mitigating LLM Hallucinations

BioStrand (a subsidiary of IPA)
4 min readJan 9, 2024


However, there are still several challenges that have to be addressed before LLMs can be reliably integrated into in-silico drug discovery pipelines and workflows. One of these is hallucinations.

Why do LLMs hallucinate?

At a time of some speculation about laziness and seasonal depression in LLMs, a hallucination leaderboard of 11 public LLMs revealed hallucination rates that ranged from 3% at the top end to 27% at the bottom of the barrel. Another comparative study of two versions of a popular LLM in generating ophthalmic scientific abstracts revealed very high hallucination rates (33% and 29%) of generating fake references.

This tendency of LLMs to hallucinate, ergo present incorrect or unverifiable knowledge as accurate, even at 3% can have serious consequences in critical drug discovery applications.

There are several reasons for LLM hallucinations.

At the core of this behavior is the fact that generative AI models have no actual intelligence, relying instead on a probability-based approach to predict data that is most likely to occur based on patterns and contexts ‘learned’ from their training data. Apart from this inherent lack of contextual understanding, other potential include exposure to noise, errors, biases, and inconsistencies in training data, training and generation methods, or even prompting techniques.

How to mitigate LLM hallucinations?

There are three broad and complementary approaches to mitigating hallucinations in large language models: prompt engineering, fine-tuning, and grounding + prompt augmentation.

Prompt Engineering

Prompt engineering is the process of strategically user inputs, or prompts, in order to guide model behavior and obtain optimal responses. There are three major approaches to prompt engineering: zero-shot, few-shot, and chain-of-thought prompts. In zero-shot prompting , language models are provided with inputs that are not part of their training data but are still capable of generating reliable results. Few-shot prompting involves providing examples to LLMs before presenting the actual query. Chain-of-thought (CoT) is based on the finding that a series of intermediate reasoning steps provided as examples during prompting can significantly improve the reasoning capabilities of large language models. The chain-of-thought concept has been expanded to include new techniques such as Chain-of-Verification (CoVe), a self-verification process that enables LLMs to check the accuracy and reliability of their output, and Chain of Density (CoD), a process that focuses on summarization rather than reasoning to control the density of information in the generated text.

Fine Tuning

Where the focus of prompt engineering is on the required to elicit better LLM output, fine-tuning emphasizes task-specific training in order to enhance the performance of pre-trained models in specific topics or domain areas. A conventional approach to LLM finetuning is full fine-tuning, which involves the additional training of pre-trained models on labeled, domain or task-specific data in order to generate more contextually relevant responses. This is a time, resource and expertise-intensive process. An alternative approach is parameter-efficient fine-tuning (PEFT), conducted on a small set of extra parameters without adjusting the entire model. The modular nature of PEFT means that the training can prioritize select portions or components of the original parameters so that the pre-trained model can be for multiple tasks. LoRA (Low-Rank Adaptation of Large Language Models), a popular PEFT technique, can significantly reduce the resource intensity of fine-tuning while matching the performance of full fine-tuning.

Grounding & Augmentation

Grounding ensures that LLMs have access to up-to-date and use-case-specific information sources to provide the relevant context that may not be available solely from the training data. Similarly, prompt augmentation enhances a prompt with contextually relevant information that enables LLMs to generate a more accurate and pertinent output.

Factual grounding is a technique typically used in the pre-training phase to ensure that LLM output across a variety of tasks is consistent with a knowledge base of factual statements. Post-training grounding relies on a range of external knowledge bases, including documents, code repositories, and public and proprietary databases, to improve the accuracy and relevance of LLMs on specific tasks.

Also read: How retrieval-augmented generation (RAG) can transform drug discovery

Integrated Intelligence with LENSᵃⁱ

Holistic life sciences research requires the sophisticated orchestration of several innovative technologies and frameworks. LENSᵃⁱ Integrated Intelligence, our next-generation data-centric AI platform, fluently blends some of the most advanced proprietary technologies into one seamless solution that empowers end-to-end drug discovery and development.

LENSᵃⁱ integrates RAG-enhanced bioLLMs with an ontology-driven NLP framework, combining neuro-symbolic logic techniques to connect and correlate syntax (multi-modal sequential and structural data) and semantics (biological functions). A comprehensive and continuously expanding knowledge graph, mapping a remarkable 25 billion relationships across 660 million data objects, links sequence, structure, function, and literature information from the entire biosphere to provide a comprehensive overview of the relationships between genes, proteins, structures, and biological pathways. Our next-generation, unified, knowledge-driven approach to the integration, exploration, and analysis of heterogeneous biomedical data empowers life sciences researchers with the high-tech capabilities needed to explore novel opportunities in drug discovery and development.

Originally published at on January 9, 2024.



BioStrand (a subsidiary of IPA)

Software and proprietary solutions for MULTI-omics data analysis. Effective research requires convenient and scalable tools.