A hybrid approach to NLP in drug discovery

BioStrand (a subsidiary of IPA)
4 min readAug 16, 2022

Artificial Intelligence-powered technologies like NLP are becoming critical to the pharmaceutical and life sciences industries as they become overwhelmed with volumes of data, almost 80 percent of which exists as inaccessible and unusable unstructured text. The availability of domain-driven, easy-to-use NLP technologies plays a central role in enabling businesses to mobilise unstructured data at scale and to embrace a truly data-driven approach to insight generation and innovation.

NLP solutions are now being used at all stages of drug discovery, from analyzing clinical trial digital pathology data to identifying predictive biomarkers. These technologies have been proven to significantly reduce cost and cycle times, enhance the scope and accuracy of analysis and provide new insights that accelerate the development of new drugs.

However, NLP in drug discovery is not a monolithic concept. There are several possible approaches, each of which may be particularly suited for specific applications. Moreover, any comprehensive solution for integrated enterprise-wide analysis will likely require a blended or hybrid NLP approach.

So here’s a quick dive into some of the key approaches to NLP in drug discovery.

Key NLP approaches

Rules-based NLP

ML-based NLP

Based on their approach to learning, ML-based methods can be further classified under supervised, unsupervised and self-supervised NLP.



This is a more advanced and computationally complex approach to analyzing, clustering and discovering patterns in unlabeled data without the need for any manual intervention. Unsupervised NLP enables the extraction of value from the predominance of unlabeled text and can be especially important for common NLP tasks like PoS tagging or syntactic parsing. However, unsupervised NLP methods cannot be used for tasks like classification without substantial retraining with annotated data.


Self-supervised learning is still a relatively new concept that has had a significant impact on NLP. In this technique, part of an input dataset is concealed and self-supervised learning algorithms then analyse the visible part to create the rules that will enable them to predict the hidden data. This process, also known as predictive or pretext learning , auto-generates the labels required for the system to learn thereby converting an unsupervised problem into a supervised problem. A key distinction between unsupervised and self-supervised learning is that in the former the focus is on the model rather than on the data while in the latter it is the other way around.

In recent times, ML-based approaches have evolved into the NLP deep learning age driven by the explosion in digital text, increased processing power in the form of GPUs and TPUsand improved activation functions for neural networks. As a result, deep learning (DL) has become the dominant approach for a variety of NLP tasks. Today, there is a lot of focus on developing DL techniques for NLP tasks that are best expressed with a graph structure. One of the biggest breakthroughs in NLP in recent times has been the transformer, a deep learning model that leverages attention mechanisms to reinvent textual analytics. DL may not be the most efficient or effective solution for simple NLP tasks but it produced some groundbreaking results in named entity recognition, document classification and sentiment analysis.

Hybrid NLP

With hybrid NLP, the focus is on combining the best of rule- and ML-based approaches without having to compromise between the advantages and drawbacks of each. A hybrid system could a machine-learning root classifier with a rules-based system with rules added to the latter for tags that have been incorrectly modelled by the former. Techniques like self-supervised learning can help reduce the human effort required for building models which in turn can be channelled into creating more scalable and accurate solutions. Combining top-down, symbolic, structured knowledge-based approaches with bottom-up, data-driven neural models will enable organizations to optimize resource usage, increase the flexibility of their models and accelerate time to insight.

Hybrid NLP with BioStrand Lensai

The BioStrand Lensai Platform is not designed around a singular technique as a monolithic solution. Instead, it is a careful amalgamation of different knowledge-based and neural models integrated seamlessly within a single pipeline. The core design philosophy is to provide researchers with access to the best NLP components, techniques and models that are most relevant to the objectives and outcomes of their project.

For instance, we took a rules-based approach to semantic parsing. Therefore, the semantic rules that have been encoded into the engine are based on linguistics and have been refined over a period of more than 10 years. This kind of algorithm is very similar to a standard algorithm, requires no training and is easily understood by human beings. For gene enrichment, our technology utilizes pure standard statistical methods. The query-based graph extractors simply transform the data stored in relational tables into a graph format.

Lensai, therefore, is not a technology or technique-based approach to biomedical NLP. It is an outcome-based hybrid NLP model designed to maximize analytical productivity by automatically mapping the best NLP technologies and techniques to the task and objectives at hand.

Register for future blogs

Originally published at https://blog.biostrand.ai.



BioStrand (a subsidiary of IPA)

Software and proprietary solutions for MULTI-omics data analysis. Effective research requires convenient and scalable tools.