Multimodal language models in protein engineering: Functional clonotyping & beyond
In the beginning of 2023, ChatGPT achieved a significant milestone of 100 million users. The utilization of generative AI defined the year, with prominent large language models such as GPT-4 captivating the world due to their remarkable mastery of natural language. Interestingly, OpenAI’s last upgrade to ChatGPT introduces powerful multimodal capabilities, enabling the model to handle various types of input going beyond text and processing images, audio and video. This showcases the future potential of generative AI for hyper-personalization and diverse application. What if these models progress to the point of mastering the language of life? Imagine protein-level LLMs learning the “semantics” and “grammar” of proteins, not just as static structures but as dynamic multimodal entities, enabling us to unravel the intricacies of their functions and behaviors at a level of detail previously unimaginable.
The need for multi-modality in protein engineering workflows
Also in protein engineering workflows, multi-modal models should be introduced, integrating multiple sources of data. Going beyond exclusively sequence data might help to solve a vast array of known problems such as protein classification, mutational effect prediction and structure prediction. In the view of antibody discovery, an interesting problem is shaped by functional clonotyping, i.e. the grouping of antibody clonal groups that target the same antigen and epitope. Typically, heavy chain CDR3 is used as unique identifier and thus, clustering is frequently performed by requiring a high percentage of HCDR3 sequence similarity and identical V-J assignments. However, it has been shown that many different HCDR3s can be identified within a target-specific antibody population [1]. Moreover, the same HCDR3 can be generated by many different rearrangements, and specific target binding is an outcome of unique rearrangements and VL pairing: “the HCDR3 is necessary, albeit insufficient for specific antibody binding.”[1] In addition, it has been demonstrated that antibodies within a same cluster, targeting a same epitope, encompass highly divergent HCDR sequences [2]. This underscores the necessity of incorporating additional “layers” of information in pursuit of the clustering objective. For instance, SPACE2 excels in clustering antibodies that bind to shared epitopes, highlighting that these clusters, characterized by functional coherence and structural similarity, embrace diversity in terms of sequence, genetic lineage and species origin [3].
Nevertheless, the potential for significant advancements may reside in the transformative capacities of LLMs, not only due to their substantial scaling advantages but also the extensive array of possibilities they present.
While natural language — Large Language Models (LLMs) excel in grasping contexts, protein based LLMs (PLMs) are advancing their understanding of meanings, contexts and the intricate relationships between the fundamental building blocks, amino acids. Much like the word “apple” assuming different meanings based on context, different amino acid (patterns) might have different nuances within protein sequences. The process begins with the tokenization of protein sequence data, transforming them into linear strings of amino acids. Some amino acids might “impact” other, more distant amino acids in such a way that it reveals a different function [semantics]. Again, compare this to two word phrases: “apple, pear and banana” versus “I bought an Apple phone” — the semantics change by context. To unravel the workings of the models behind LLMs — the so called transformer models — attention layers yield valuable information. Which context-information is important to classify apple as being a “fruit” or “tech company”? Ask now a similar question for classifying proteins: which context residues/ residue patterns are influencing another residue/ pattern to take part in a different function? Does the model learn residue-residue interactions (reflected in attention weights) that overlap with structural interactions? By overlapping protein-domain knowledge on the model’s learnt embedding representations, we can learn underlying protein intricacies. Moreover, we believe that utilizing these lower-layer embeddings as predictive features instead of/ on top of the final-layer embeddings might help to make the model more understandable and transparent.
This clearly fits in the idea of strategically combining multi-modal data. The potential for improving predictive performance, e.g. improving functional clonotyping of antibodies, lies in the strategic concatenation of embeddings from different layers across various protein language models. Indeed, PLMs are trained for different purposes. For e.g. AbLang [4] is trained on predicting missing amino acids in antibody sequences, while AntibertY [5] is trained on predicting paratope-binding residues. The model’s embeddings could encompass distinct, perhaps non-overlapping and unique angles of protein-relevant information — being it (a combination of) structural, functional, physiochemical, immunogenicity-related … information. Delving deeper into the realm of functional clonotyping, where epitope-binning gains importance, relying solely on antigen-agnostic models may prove insufficient. Our curiosity lies in understanding how residues on the paratope interact with those on the epitope — a two-fold perspective that has been addressed through cross-modal attention. This method, akin to Graph-Attention-Network model applied to a bipartite antibody-antigen graph emerges as a compelling approach for modelling multimodality in antibody-antigen interaction and more broadly in protein-protein interactions. [6] In general, we should build comprehensive representations that go beyond individual layers to open up new avenues for understanding protein language.
Protein words to capture semantics
Language models for natural language learn how words are used in context, i.e. words with similar context have similar meanings. This allows the model to understand meanings based on distributional patterns alone. In natural language, symbols like spaces and punctuation help identify meaningful words, making explicit linguistic knowledge less necessary. However, applying this idea to proteins is uncertain because there’s no clear definition of meaningful protein units, like ‘protein words.’ We need a more analytical, expertise-driven approach to identify meaningful parts in protein sequences. This is where BioStrand’s HYFT technology comes into play. Amino acid patterns offer a more refined approach to embeddings compared to full sequence embeddings, analogous to the way semantic embeddings capture “logical” word groups or phrases to improve understanding in textual language. While full sequence embeddings encapsulate the entire protein sequence in a holistic manner, amino acid patterns focus on specific meaningful blocks within the sequence. BioStrand’s proprietary HYFTs, which serve as protein building blocks with well-defined boundaries, enhance robustness to sequence variability by emphasizing critical regions and downplaying non-critical or less relevant areas in the full protein sequence.
Moreover, the HYFTs serve as a central and unifying connector element laying the foundation for a holistic data management system. This integration extends beyond the incorporation of protein sequential , structural and functional data encompassing both flat metadata and vector embedding data, as well as textual enrichment data extracted from literature. These connector elements can traverse omics databases or external datasets such as IEDB serving as starting points for NLP searches. In this way, a bridge is established between genetic information and relevant literature.
Taking all this information together, an integrated data management system becomes necessary to build generalized foundation models for biology, rather than siloing each step independently. This integration extends beyond the incorporation of protein sequential , structural and functional data encompassing both flat metadata and vector embedding data, as well as textual enrichment data extracted from literature. The antibody discovery process undergoes a transformative shift, becoming a more informed journey where the flow of information is rooted in genetic building blocks. At each step, a comprehensive understanding is cultivated, by synthesizing insights from the amalgamation of genetic, textual, and structural dimensions, including diverse embeddings from different layers of LLMs capturing varying information sources. This is where LENSai comes into play. By leveraging a vast knowledge graph interconnecting syntax (multi-modal sequential and structural data) and semantics (biological function), combined with the insights captured at the residue, region or HYFT level — harnessed by the power of LLM embeddings — this paves the way to improve drug-discovery relevant tasks such as functional clustering, developability prediction or prediction of immunogenicity risk. LENSai’s advanced capabilities empower researchers to explore innovative protein structures and functionalities, unlocking new opportunities in antibody design and engineering.
Julie Delanote | Data Scientist at BioStrand (a subsidiary of IPA)
[1] https://www.frontiersin.org/articles/10.3389/fimmu.2018.00395/full
[3] https://www.frontiersin.org/articles/10.3389/fmolb.2023.1237621/full
https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac046/6609807
Originally published at https://blog.biostrand.ai on February 1, 2024.