Integrating knowledge graphs and large language models for next-generation drug discovery

BioStrand (a subsidiary of IPA)
4 min readOct 27, 2023

Across several previous blogs, we have explored the importance of Knowledge Graphs, Large Language Models (LLMs), and semantic analysis in biomedical research. Today, we focus on integrating these distinct concepts into a unified model that can help advance drug discovery and development.

But before we get to that, here’s a quick synopsis of the knowledge graph, LLM & semantic analysis narrative so far.

LLMs, knowledge graphs & semantics in biomedical research

It has been established that biomedical LLMs — domain-specific models pre-trained exclusively on domain-specific vocabulary — outperform conventional tools in many biological data-based tasks. It is therefore considered inevitable that these models will quickly expand across the broader biomedical domain.

However, there are still several challenges, such as hallucinations and interpretability for instance, that have to be addressed before biomedical LLMs can be taken mainstream. A key biomedical domain-specific challenge is LLMs’ lack of semantic intelligence.

LLMs have, debatably, been described as ‘stochastic parrots’ that comprehend none of the language , relying instead on ‘learning’ meaning based on the large-scale extraction of statistical correlations. This has led to the of whether modern LLMs really possess any inductive, deductive, or abductive reasoning abilities.

Biomedical Knowledge Graphs address this key capability gap in LLMs by going beyond statistical correlations to bring the power of context to biomedical language models. Knowledge graphs help capture the inherent graph structure of biomedical data, such as drug-disease interactions and protein-protein interactions, and model complex relationships between disparate data elements into one unified structure that is both human-readable and computationally accessible.

Semantic Knowledge Graphs and LLMs in Drug Discovery

Even as research continues to focus on the symbiotic possibilities of a unified knowledge graph-LLM framework, these concepts are already having a transformative impact on several drug discovery and development processes.

Take target identification, for instance, a critical step in drug discovery with consequential implications for downstream development processes. AI-powered language models have been shown to outperform state-of-the-art approaches in key tasks such as biomedical named entity recognition (BioNER) and biomedical relation extraction. Transformer-based LLMs are being used in chemoinformatics to advance drug-target relationship prediction and to effectively novel, valid, and unique molecules. LLMs are also evolving beyond basic text-to-text frameworks to multi-modal large language models (MLLMs) that bring the combined power of image plus text adaptive learning to target identification and validation. Meanwhile, the semantic capabilities of knowledge graphs enhance the efficiencies of target identification by enabling the harmonization and enrichment of heterogeneous data into one connected framework for more holistic exploration and analysis.

AI-enabled LLMs are increasingly being used across the drug discovery and development to predict drug-target interactions (DTIs) and drug-drug interactions, molecular properties, such as pharmacodynamics, pharmacokinetics, and toxicity, and even likely drug withdrawals from the market due to safety concerns. In the drug discovery domain, biomedical knowledge graphs are being across a range of including polypharmacy prediction, DTI prediction, adverse drug reaction (ADR) prediction, gene-disease prioritization, and drug repurposing.

The next significant point of inflection will be the integration of these powerful technologies into one synergized model to drive a stepped increase in performance and efficiency.

Optimizing LLMs for Biomedical Research

There are three key challenges — knowledge cut-off, hallucinations, and interpretability — that must be addressed before LLMs can be reliably integrated into biomedical research. There are currently two complementary approaches to mitigate these challenges and optimize biomedical LLM performance.

The first approach is to leverage the structured, factual, domain-specific knowledge contained in biomedical knowledge graphs to enhance the factual accuracy, consistency, and transparency of LLMs. Using graph-based query languages, the pre-structured data embedded in knowledge graph frameworks can be directly queried and integrated into LLMs.

Combining the knowledge graph- and RAG-based approaches will lead to significant improvements in LLM performance in terms of factual accuracy, context-awareness, and continuous knowledge enrichment.

LENSᵃⁱ: The Next-Generation RAG-KG-LLM Platform

At BioStrand, we have successfully actualized a next-generation unified knowledge graph-large language model framework for holistic life sciences research. At the core of our LENSᵃⁱ platform is a comprehensive and continuously expanding knowledge graph that maps 25 billion relationships across 660 million data objects, linking sequence, structure, function, and literature information from the entire biosphere. Our first-in-class technology provides a holistic understanding of the relationships between genes, proteins, and biological pathways thereby opening up powerful new opportunities for drug discovery and development. The platform leverages the latest advances in ontology-driven NLP and AI-driven LLMs to connect and correlate syntax (multi-modal sequential and structural data ) and semantics (functions). Our unified approach to biomedical knowledge graphs, retrieval-augmented generation models, and large language models combines the reasoning capabilities of LLMs, the semantic proficiency of knowledge graphs, and the versatile information retrieval capabilities of RAG to streamline the integration, exploration, and analysis of all biomedical data.

Originally published at on October 27, 2023.



BioStrand (a subsidiary of IPA)

Software and proprietary solutions for MULTI-omics data analysis. Effective research requires convenient and scalable tools.