Transforming drug design: Vector search in text analysis

6 min readJun 13, 2024

Transforming drug design: Vector search in text analysis

AI-Driven Rational Drug Design is central to BioStrand’s mission to power the intersection of biotech discovery, biotherapeutics and AI. ‘AI-driven’ signifies the application of artificial intelligence (AI), including machine learning (ML) and natural language processing (NLP). ‘Rational’ alludes to the process of designing drugs based on the understanding of biological targets. This approach leverages computational models and algorithms to predict how drug molecules interact with their target biological molecules, such as proteins or enzymes, involved in disease processes. The goal is to create more effective and safer drugs by precisely targeting specific mechanisms within the body.

Integration of Complex Biological Data

The LENSᵃⁱ™ Integrated Intelligence Platform powered by patented HYFT technology is unique in its data integration of structured and unstructured data, including genomic sequences, protein structures, scientific literature and clinical notes, facilitating a comprehensive understanding of biological systems.

Advanced Computational Techniques

BioStrand’s approach to drug discovery combines AI for rapid compound screening and predictive modeling with text analysis to retrieve information from research articles. This helps to identify promising drug candidates and optimize their properties for better efficacy and safety, which significantly reduces R&D timelines.

The use of AI, ML, and NLP technologies within the LENSᵃⁱ platform and the use of different protein Large Language Model (LLM) embeddings, facilitate the discovery of novel drug targets. These technologies allow for the identification of patterns, relationships, and insights within large datasets.

The combination of BioStrand’s technologies with the InterSystems IRIS data platform introduces a powerful vector search mechanism that facilitates semantic analysis. This approach transforms the search for relevant biological and chemical information by enabling searches based on conceptual similarity rather than just keywords. As a result, researchers can uncover deeper insights into disease mechanisms and potential therapeutic targets.

We wrote about vector search in an earlier blog . Here we illustrate vector search for text. In a next post, we will dive into the application of vector search for protein analytics.

Utilizing Vector Search in Text Analysis

The primary challenge for text search is locating specific and accurate information within a vast amount of unstructured data. It is like finding a needle in a haystack. Conducting a simple keyword search in PubMed can yield thousands of results. While generative models can provide concise answers to questions within seconds, their accuracy is not always guaranteed. We implemented Retrieval-Augmented Generation (RAG ) to combat the hallucinations that generative chat systems may experience. Moreover, RAG systems deliver up-to-date results and are able to refer to their sources, which makes their responses traceable. However, like all generative systems they struggle to handle large input prompts at once. This is where vector search becomes essential. Vector search is a valuable tool to guide you to the precise area within your data haystack.

Representing Meaning in Vector Space

Search terms often have various meanings in different contexts. For instance, the abbreviation ‘ADA’ could refer to anti-drug antibodies, the American Dental Association, the American Diabetes Association, the Americans with Disabilities Act, adenosine deaminase, adalimumab, and other entities. By encoding text data with embeddings, one can narrow down the focus to the meaning that aligns with the search query.

The figure below illustrates a two-dimensional UMAP visualization of the embeddings for PubMed abstracts containing ‘ADA’. While the visual representation emphasizes similarity and does not provide a scalable measure for actual distance in the multidimensional vector space, it does demonstrate the presence of semantic ambiguity in the vector-based embeddings. Thus, encoding the input allows for data clustering and focusing on the most relevant clusters.

Open ADA abstracts | UMAP

The embeddings used here are dense vectors. Dense vectors are compact numerical representations of data points, typically generated by large language models, in this case PubMedBERT. Dense vectors capture the semantic meaning of text, allowing the system to retrieve relevant information even if the exact keywords are not present. This nuanced and context-aware retrieval offers advantages over traditional keyword-based methods.

On the other hand, sparse vectors are typically high-dimensional but contain many zero values. As an example, a bag-of-words vector for a short English sentence would contain a one for every word that is present in the sentence and a zero for every English word that is not present in the sentence. The result is a very sparse vector with many zero values and a couple of ones. Sparse vectors are often generated using traditional methods like TF-IDF or BM25, which focus on the presence or frequency of specific terms in the text. These vectors require fewer resources and offer faster retrieval speeds.

Searching in Vector Space

When generating embeddings, there are multiple levels to consider. Chunking is the process of breaking down large pieces of text into smaller segments that will be considered as units of relevant context. From tokens to documents, each level offers a different way to understand and analyze text data.

Starting at the most granular level, tokens represent individual words or parts of individual words within a text. Large language models often calculate embeddings based on single-word tokens. This may dilute the semantic richness of the text. The LENSᵃⁱ Integrated Intelligence Platform uses concepts. These are words or word groups that form a unit of meaning. Concepts are more specific than tokens for keyword search. Moreover, dense embeddings of concepts within a sentence are particularly well-suited for detecting synonyms.

Token embeddings — ‘Concept’ embeddings

The following UMAP visualization of concept embeddings shows similar embeddings for the semantically related instances of ‘ada treatment’ and ‘ada therapy’ and also for instances of ‘ada inhibition’ and ‘ada treatment’, whereas the embeddings for ‘ada professional practice committee’, ‘ada activity’, ‘ada scid’ and ‘ada formation’ build separate non-overlapping clusters.

Open ADA concepts UMAP2

CRC constructs (concept-relation-concept patterns) effectively capture the intricate boundaries of semantic meaning. Focusing on CRCs enhances the semantic similarity search while filtering out non-relevant sentence parts, yielding a more condensed representation of meaning. Moving up to the levels of sentence and document embeddings can be useful for obtaining a more general idea of the context rather than focusing on a particular search term in a query.

The level of embeddings that is most relevant will depend on the specific use case at hand.

In conclusion, vector search presents numerous opportunities to optimize search results by guiding users to their most relevant data. Leveraging dense and sparse vectors, as well as embeddings on various levels, can be combined to create a hybrid system tailored to specific use cases. In the field of AI-driven rational drug design, vector search is an additional computational technique that fits in a multidisciplinary approach, supporting more than only text data, as will become clear in our future blog post about vector search for protein analysis.

Combining the LENSᵃⁱ Integrated Intelligence Platform, with the InterSystems IRIS data platform creates a robust vector search mechanism, enhancing rational drug discovery and personalized medicine. Additionally, LENSᵃⁱ is designed to support hallucination-free, traceable, and up-to-date Retrieval-Augmented Generation, helping researchers access accurate and reliable data.

Originally published at https://blog.biostrand.ai on June 13, 2024.