Advancing rational drug design: Vector search in protein analytics

BioStrand (a subsidiary of IPA)
8 min readJul 11, 2024

--

Drug discovery processes are typically organized in a step-by-step manner — going from target identification to lead optimization processes. This implies that data is being siloed at every process, leading to an exponential loss of quantitative and qualitative insights across the different processes. To realize the full potential of drug discovery, data integration within a data-driven automation platform is essential.

The LENSᵃⁱ™ Foundation AI model powered by HYFT technology is designed to solve the challenges behind AI-driven Rational Drug Design, harnessing advanced AI and ML capabilities to navigate the complexities of drug discovery with high precision. By integrating predictive modelling, data analysis and lead optimization functionalities, LENSᵃⁱ accelerates the end-to-end discovery and development of promising drug candidates.

The LENSᵃⁱ system uniquely integrates both structured and unstructured data, serving as a centralized graph for storing, querying, and analyzing diverse datasets, including different omics layers, chemical, and pharmacological information. With LENSᵃⁱ, data from every phase of the drug discovery process is no longer siloed but represented as subgraphs within an interconnected graph that summarizes data across all processes. This interconnected approach enables bidirectional and cyclical information flow, allowing for flexibility and iterative refinement.

For example, during in-silico lead optimization, challenges may arise regarding pharmacokinetic properties or off-target effects of lead compounds. By leveraging the integrated knowledge graph, we can navigate back to earlier phases to reassess decisions and explore alternative strategies. This holistic view ensures that insights and adjustments can be continuously incorporated throughout the drug discovery process.

Navigation through integrated knowledge graphs of complex biological data is made possible by the patented HYFT technology. HYFTs, which are amino acid patterns mined across the biosphere, serve as critical connectors within the knowledge graph by capturing diverse layers of information at both the subsequence and sequence levels. The HYFTs encapsulate information about ‘syntax’ (the arrangement of amino acids), as well as ‘structure’ and ‘function,’ and connect this data to textual information at the sentence and concept levels. This HYFT-based multi-modal integration ensures that we move beyond mere ‘syntax’ to incorporate ‘biological semantics,’ representing the connection between structure and function.

Within this single framework, detailed structural information is aligned with relevant textual metadata, providing a comprehensive understanding of biological sequences.

Exploring textual metadata could be very useful in the target identification stage. For example, to gather detailed information on the target epitopes: “In which species are these epitopes represented?” “Can we extract from literature additional information and insights on the epitopes?”. This information can be yielded by querying the knowledge graph and harnessing the benefits of the fine-grained HYFT-based approach, capturing information at the ‘subsequence’ level.

Indeed, at the HYFT level, relevant textual concepts (sub-sentence level) are captured, which allows us to identify whether a specific HYFT, represented in the target, might reveal relevant epitopes.

Apart from textual meta-data there is ‘flat’ metadata such as immunogenicity information, germline information, pharmacological data, developability data, and sequence liability presence.At each of the previously mentioned information layers, additional ‘vector’ data is obtained from various protein large language models (pLLMs). This means that an embedding is associated with each (sub)-sequence or concept. This allows for ‘vector’ searches, which, based on the embeddings, can be used to identify similar sequences, enhancing tasks like protein structure prediction and functional annotation. For a deep dive into vector search, see our vector search in text analysis blog here. This capability allows for the extraction of a wider range of features and the uncovering of hidden patterns across all these dimensions.

LENSᵃⁱ: The importance of embeddings at the sub-sequence level

BioStrand LENSᵃⁱ’s comprehensive approach in protein analytics is similar to text-based analytics. In text analysis, we refine semantic boundaries by intelligently grouping words to capture textual meaning. Similarly, in protein analytics, we strategically group residue tokens (amino acids) to form sequential HYFTs. Just as words are clustered into synonyms in text analytics, “protein words” are identified and clustered based on their biological function in protein analytics. These “protein words,” when present in different sequences, reveal a conserved function. By leveraging this method, we gain a deeper understanding of the functional conservation across various protein sequences.

Thus, the LENSᵃⁱ platform based on HYFT technology analyses proteins at the sub-sequence level focusing on the HYFT patterns as well as on the full-sequence level. Comparable to natural language, residues might be less relevant and do not contribute to meaning, which, in case of proteins, can be translated into function. Therefore, by focusing on HYFTs, we obtain a more condensed information representation and noise reduction by excluding the information captured in non-critical regions.

In text analysis, we can almost immediately recognize semantic similarity. We recognize sentences similar in meaning, although compiled of different words, because of our natural understanding of synonyms. In protein language to identify ‘functional similarity’, in other words, to distinguish whether two different amino acid patterns (HYFTs) might yield the same function, we use a mathematical method i.e. pLLMs.

pLLMs are transformer-based models that generate an embedding starting from single amino acid residues. Depending on the data the pLLM is trained on (typically millions of protein sequences), it tries to discover hidden properties by diving into the residue-residue connections (neighboring residues on both short or longer distances).

Figure 1: BioStrand’s method of chunking tokens

The dataset and task a pLLM were trained on, determine the represented properties, which can vary from one pLLM to another pLLM. By stacking the embeddings from different large language model (LLM) a more complete view on the protein data is generated.

Furthermore, we can use clustering and vector search algorithms to group sequences that are similar in a broad range of dimensions.

Protein embeddings are typically generated at the single amino acid level. In contrast, the HYFT-based model obtains embeddings from LLMs at a pattern level, by concatenating residue-level embeddings. These ‘protein words or HYFT’ level embeddings can be obtained from several pre-trained LLMs — varying from antibody-specific LLMs to more generic pLLMs.

This HYFT-based embedding model offers several benefits.

  • First, this approach captures richer and more informative embeddings compared to the single residue level embeddings
  • Second, the concatenation of residue-level embeddings allows for preserving sequence-specific patterns, enhancing the ability to identify functional and structural motifs within proteins.
  • Lastly, integrating different LLMs ensures that these embeddings leverage vast amounts of learned biological knowledge, improving the accuracy and robustness of downstream tasks such as protein function prediction and annotation.

So, if we want to identify which HYFTs are ‘synonyms’, we deploy the HYFT-based level embeddings.

Returning to the language analogy, where ‘apple’ will take a similar place in the embedding space as ‘orange’ or ‘banana’ — because they are all fruits — in protein analytics we are interested in the HYFTs that take similar places in the embedding space — because they all perform the same function in a certain context.

Figure 2: Embeddings at the sub-sequence level: concepts versus HYFTs

As the figure above (Fig. 2) illustrates, the word “apple” can have different meanings depending on the context (referring to a phone or a fruit), the sequence HYFT ‘VKKPGAS’ can also appear in various contexts, representing different protein annotations and classifications. For instance, a specific HYFT is found in organisms ranging from bacteria and fungi to human immunoglobulins. Consequently, the embeddings for HYFT VKKPGAS might occupy different positions in the embedding space, reflecting these distinct functional contexts.

Use Case: Transforming Antibody Discovery with Integrated Vector Search in Hit Expansion Analysis

In the LENSᵃⁱ HIT expansion analysis pipeline, outputs from phage-display, B-cell, or Hybridoma technologies are combined with a large-scale enriched antibody sequence dataset sequenced by NGS. The primary goal is to expand the number and diversity of potential binders-functional antibodies from the NGS dataset that are closely related to a set of known binders.

The data from the NGS repertoire set and the known binders are represented in a multi-modal knowledge graph, incorporating various modalities such as sequence, structure, function, text, and embeddings. This comprehensive representation allows the NGS repertoire set to be queried to identify a diverse set of additional hits by simultaneously exploiting different information levels, such as structural, physiochemical, and pharmacological properties like immunogenicity.

A vital component of this multi-modal knowledge graph is the use of vector embeddings , where antibody sequences are represented in multi-dimensional space, enabling sophisticated analysis. These vector embeddings can be derived from different LLMs. For instance, in the example below, clinical antibodies obtain sequence-level embeddings from an antibody-specific LLM, represented in 2D space and colored by their immunogenicity score. This immunogenicity score can be used to filter some of the antibodies, demonstrating how metadata can be utilized to select embedding-based clusters.

Furthermore, using vector embeddings allows for continuous data enrichment and the integration of latest information into the knowledge graph at every step of the antibody discovery and development cycle, enhancing the overall process.

In protein engineering, this continuous data enrichment proves advantageous in various aspects, such as introducing specific mutations aimed at enhancing binding affinity, humanizing proteins, and reducing immunogenicity. This new data is dynamically added to the knowledge graph, ensuring a fully integrated view of all the data throughout the antibody design cycle. These modifications are pivotal in tailoring proteins for therapeutics, ensuring they interact more effectively with their targets while minimizing unwanted immune responses.

Figure 3. The clinical antibodies obtain sequence-level embeddings from an antibody-specific LLM, represented in 2D space and colored by their immunogenicity score (1 indicating high immunogenic)

Conclusion

The LENSᵃⁱ platform provides a robust multi-modal approach to optimize antibody discovery and development processes.

By solving the integration of sequence, structure, function, textual insights, and vector embeddings, LENSᵃⁱ bridges gaps between disparate data sources.

The platform enhances feature extraction by leveraging embedding data from various LLMs, capturing a wide array of biologically relevant ‘hidden properties’ at the sub-sequence level. This capability ensures a comprehensive exploration of nuanced biological insights, facilitating an integrated data view.

By utilizing “ vector search “, the platform can efficiently query and analyze these embeddings, enabling the identification of similar sequences and functional motifs across large and complex datasets. This approach not only captures the ‘syntax’ and ‘structure’ of amino acid patterns but also integrates ‘biological semantics,’ thereby providing a holistic understanding of protein functions and interactions.

Consequently, LENSᵃⁱ improves the efficiency of antibody discovery and development from identifying novel targets to optimizations in therapeutic development, such as hit expansion analysis, affinity maturation, humanization, and immunogenicity screening processes.

Furthermore, LENSᵃⁱ enables cyclical enrichment of antibody discovery and development processes by adding and integrating information into a knowledge graph at every step of the development cycle.

This continuous enrichment sets a new benchmark for integrated, data-driven approaches in biotechnology, ensuring ongoing improvements and innovations.

References:

  1. Frontiers | Many Routes to an Antibody Heavy-Chain CDR3: Necessary, Yet Insufficient, for Specific Binding
  2. Benchmarking antibody clustering methods using sequence, structural, and machine learning similarity measures for antibody discovery applications
  3. Rosario Vitale, Leandro A Bugnon, Emilio Luis Fenoy, Diego H Milone, Georgina Stegmayer, Evaluating large language models for annotating proteins, Briefings in Bioinformatics, Volume 25, Issue 3, May 2024, bbae177

Originally published at https://blog.biostrand.ai on July 11, 2024.

--

--

BioStrand (a subsidiary of IPA)

Software and proprietary solutions for MULTI-omics data analysis. Effective research requires convenient and scalable tools.