How retrieval-augmented generation (RAG) can transform drug discovery

BioStrand (a subsidiary of IPA)
5 min readDec 14, 2023

In a recent article on knowledge graphs and large language models (LLMs) in drug discovery, we noted that despite the transformative potential of LLMs in drug discovery, there were several critical challenges that have to be addressed in order to ensure that these technologies conform to the rigorous standards demanded by life sciences research.

Synergizing knowledge graphs with LLMs into one bidirectional data- and knowledge-based reasoning framework addresses several concerns related to hallucinations and lack of interpretability. However, that still leaves the challenge of enabling LLMs access to external data sources that address their limitation with respect to factual accuracy and up-to-date knowledge recall.

Retrieval-augmented generation (RAG), together with knowledge graphs and LLMs, is the third critical node on the trifecta of techniques required for the robust and reliable integration of the transformative potential of language models into drug discovery pipelines.

Why Retrieval-Augmented Generation (RAG)?

One of the key limitations of general-purpose LLMs is their training data cutoff, which essentially means that their responses to queries are typically out of step with the rapidly evolving nature of information. This is a serious drawback, especially in fast-paced domains like life sciences research.

Retrieval-augmented generation (RAG) enables biomedical research pipelines to optimize LLM output by:

  1. Grounding the language model on external sources of targeted and up-to-date knowledge to constantly refresh LLMs’ internal representation of information without having to completely retrain the model. This ensures that responses are based on the most current data and are more contextually relevant.
  2. Providing access to the model’s information so that responses can be validated for accuracy and sources, ensuring that its claims can be checked for relevance and accuracy.

In short, retrieval-augmented generation provides the framework necessary to augment the recency, accuracy, and interpretability of LLM-generated information.

How does retrieval-augmented generation work?

Retrieval augmented generation is a natural language processing (NLP) approach that combines elements of both information retrieval and text generation models to enhance the performance of knowledge-intensive tasks.

The retrieval component aggregates information relevant to specific queries from a predefined set of documents or knowledge sources which then serves as the context for the generation model.

Once the information has been retrieved, it is combined with the input context to create an integrated context containing both the original query and the relevant retrieved information.

This integrated context is then fed into a generation model to generate an accurate, coherent, and contextually appropriate response based on both pre-trained knowledge and retrieved query-specific information.

The RAG approach gives life sciences research teams more control over grounding data used by a biomedical LLM by honing it on enterprise- and domain-specific knowledge sources. It also enables the integration of a range of external data sources, such as document repositories, databases, or APIs, that are most relevant to enhancing model response to a query.

The value of RAG in biomedical research

Conceptually, the retrieve+generate model’s capabilities in terms of dealing with dynamic external information sources, minimizing hallucinations, and enhancing interpretability make it a natural and complementary fit to augment the performance of bioLLMs.

In order to quantify this augmentation in performance, a recent research effort evaluated the ability of a retrieval-augmented generative agent in biomedical question-answering vis-a-vis LLMs (GPT-3.5/4), state-of-the-art commercial tools (Elicit, Scite, and Perplexity) and humans (biomedical researchers).

The RAG agent, PaperQA, was first evaluated against a standard multiple-choice LLM-evaluation dataset, PubMedQA, with the provided context removed to test the agents’ ability to retrieve information. In this case, the RAG agent beats GPT-4 by 30 points (57.9% to 86.3%).

Next, the researchers constructed a more complex and more contemporary dataset (LitQA), based on more recent full-text research papers outside the bounds of LLM’s pre-training data, to compare the integrated abilities of PaperQA, LLMs and human researchers to retrieve the right information and to generate an accurate answer based on that information.

Again, the RAG agent outperformed both pre-trained LLMs and commercial tools with overall accuracy (69.5%) and precision (87.9%) scores that were on par with biomedical researchers. More importantly, the RAG model produced zero hallucinated citations compared to LLMs (40–60%).

Despite being just a narrow evaluation of the performance of the retrieval+generation approach in biomedical QA, the above research does demonstrate the significantly enhanced value that RAG+BioLLM can deliver compared to purely generative AI.

The combined sophistication of retrieval and generation models can be harnessed to enhance the accuracy and efficiency of a range of processes across the drug discovery and development pipeline.

Retrieval-augmented generation in drug discovery

In the context of drug discovery, RAG can be applied to a range of tasks, from literature reviews to biomolecule design.

Currently, generative models have demonstrated potential for de novo molecular design but are still hampered by their inability to integrate multimodal information or provide interpretability. The RAG framework can facilitate the retrieval of multimodal Information, from a range of sources, such as chemical databases, biological data, clinical trials, images, etc., that can significantly augment generative molecular design.

The same expanded retrieval + augmented generation template applies to a whole range of applications in drug discovery like, for example, compound design (retrieve compounds/ properties and generate improvements/ new properties), drug-target interaction prediction (retrieve known drug-target interactions and generate potential interactions between new compounds and specific targets. adverse effects prediction (retrieve known adverse and generate modifications to eliminate effects). etc.

The template even applies to several sub-processes/-tasks within drug discovery to leverage a broader swathe of existing knowledge to generate novel, reliable, and actionable insights. In target validation, for example, retrieval-augmented generation can enable the comprehensive generative analysis of a target of interest based on an extensive review of all existing knowledge about the target, expression patterns and functional roles of the target, known binding sites, pertinent biological pathways and networks, potential biomarkers, etc.

In short, the more efficient and scalable retrieval of timely information ensures that generative models are grounded in factual, sourceable knowledge, a combination with limitless potential to transform drug discovery.

An integrated approach to retrieval-augmented generation

Retrieval-augmented generation addresses several of the critical limitations and augments the generative capabilities of bioLLMs. However, there are additional design rules and multiple technological profiles that have to come together to successfully address the specific requirements and challenges of life sciences research.

Our LENSᵃⁱ Integrated Intelligence Platform seamlessly unifies the semantic proficiency of knowledge graphs, the versatile information retrieval capabilities of retrieval-augmented generation, and the reasoning capabilities of large language models to reinvent the Understanding-Retrieve-Generate cycle in biomedical research.

Our unified approach empowers researchers to query a harmonized life science knowledge layer that integrates unstructured information & ontologies into a knowledge graph.

A semantic-first approach enables a more accurate understanding of research queries, which in turn results in the retrieval of content that is most pertinent to the query. The platform also integrates retrieval-augmented generation with structured biomedical data from our HYFT technology to enhance the accuracy of generated responses.

And finally, LENSᵃⁱ combines deep learning LLMs with neuro-symbolic logic techniques to deliver comprehensive and interpretable outcomes for inquiries.

To experience this unified solution in action, please contact us here.

Originally published at on December 14, 2023.



BioStrand (a subsidiary of IPA)

Software and proprietary solutions for MULTI-omics data analysis. Effective research requires convenient and scalable tools.