Amino Meenie Miney Moe — How to Characterise Protein Domains

BioStrand (a subsidiary of IPA)
6 min readFeb 21, 2022

Classifying genes and assigning functions to gene products is a fundamental step in converting huge volumes of sequence data into useful biological information. Protein and DNA sequencing are two common approaches to biological identification.

However, there are several reasons for researchers to opt for protein sequencing over DNA sequencing. First, there is the availability of good, well-annotated databases of protein sequences and protein sequence signatures.

Second, protein sequences are more closely aligned with function and provide information on post-translational modifications central to the structure and function of a protein.

And finally, since proteins have a larger alphabet — 20 amino acids compared to the 4 bases of DNA — protein sequence searches have a lower signal-to-noise ratio and provide greater sensitivity than DNA sequencing.

Protein characterisation has several critical downstream research applications, such as analysing the impact of potential candidates in drug discovery and development, screening biological products for safety and quality, and disease control and prevention.

Techniques for protein characterisation

Proteins can be characterised at different levels — for instance, in terms of their function in a cell or even the context of that specific function. This means protein sequence analysis could be focussed on the entire length of the sequence, at just single domains or motifs, or even at the granular level of specific amino-acid residues.

However, there are significantly more challenges involved in characterising a protein than a small molecule. Why? Because they have more properties to investigate, their complex and varied structures are sensitive to conditions like temperature and pH, and they tend to undergo post-translational modifications that affect their properties.

As a result, protein characterisation often tended to be incomplete in the past, as it mainly relied on analytical approaches based on biological methods.

However, the increasing availability of completely sequenced genomes, next-generation technologies, and sophisticated bioinformatics tools for searching and integrating with proteomics databases means that it is now possible to comprehensively and accurately characterise proteins at any level.

Today, there are several techniques available for protein characterisation, including mass spectrometry, Edman degradation, static and dynamic light scattering, ultra-high-performance liquid chromatography, gel electrophoresis, and amino acid analysis.

However, the choice of technique, in the biopharmaceutical industry, for example, depends on a number of variables including sensitivity, specificity, ability to accurately quantify specific analytes, speed, and ease of use. Today, no single method can simultaneously address all those factors.

In this walkthrough, we will demonstrate how BioStrand R&R can help mitigate some of the current challenges and limitations of protein characterisation.

Research Objective

In this workflow, we demonstrate how you can start with any random sequence fragment to:

  1. Simultaneously retrieve all matching protein sequences from multiple protein databases
  2. Find sequences homologous with the query sequence, get detailed sequence information on each result, and identify associated domains

The BioStrand protein characterisation workflow

We start with a protein sequence:

MSSLGASFVQIKFDDLQFFENCGGGSFGSVYRAKWISQDKEVAVKKLLKIEKEAEILSVLSHRNIIQFYGVILEPPNYGIVTEYASLGSLYDYINSNRSEEMDMDHIMTWATDVAKGMHYLHMEAPVKVIHRDLKSRNVVIAADGVLKICDFGASRFHNHTTHMSLVGTFPWMAPEVIQSLPVSETCDTYSYGVVLWEMLTREVPFKGLEGLQVAWLVVEKNERLTIPSSCPRSFAELLHQCWEADAKKRPSFKQIISILESMSNDTSLPDKCNSFLHNKAEWRCEIEATLERLKKLERDLSFKEQELKERERRLKMWEQKLTEQSNTPLLPSFEIGAWTEDDVYCWVQQLVRKGDSSAEMSVYASLFKENNITGKRLLLLEEEDLKDMGIVSKGHIIHFKSAIEKLTHDYINLFHFPPLIKDSGGEPEENEEKIVNLELVFGFHLKPGTGPQDCKWKMYMEMDGDEIAITYIKDVTFNTNLPDAEILKMTKPPFVMEKWIVGIAKSQTVECTVTYESDVRTPKSTKHVHSIQWSRTKPQDEVKAVQLAIQTLFTNSDGNPGSRSDSSADCQWLDTLRMRQIASNTSLQRSQSNPILGSPFFSHFDGQDSYAAAVRRPQVPIKYQQITPVNQSRSSSPTQYGLTKNFSSLHLNSRDSGFSSGNTDTSSERGRYSDRSRNKYGRGSISLNSSPRGRYSGKSQHSTPSRGRYPGKFYRVSQSALNPHQSPDFKRSPRDLHQPNTIPGMPLHPETDSRASEEDSKVSEGGWTKVEYRKKPHRPSPAKTNKERARGDHRGWRNF

STEP 1: Sequence Search

First, we paste the selected sequence into the search box and hit enter to launch the search.

By default, BioStrand R&R gives us the power to look for every relevant association of the search sequence across all omics layers. But we can also apply various high-level filters to focus our research — for instance, solely on protein sequences and on specific protein databases.

STEP 2: Apply high-level filters to focus your search

We can focus the search on protein sequences by simply double-clicking on the AA card to deselect the others (DNA & mRNA). We can also choose which data sources we want to include in our search by clicking on the sources that we want to deselect and exclude from the search process.

In this example, we have deselected the patprot data source on the consideration that it does not provide pertinent insights on functions associated with our research objectives.

As we can see, applying the high-level filters instantly highlights the fact that around 56% of related sequences across all databases point to MAP kinase functions (508 out of 914 total matches).

We can then proceed to identify the domains related to our query sequence.

STEP 3: Identify domains related to your query sequence

With BioStrand R&R, there are multiple approaches to validate or verify the domains related to the query sequence. In this protein example, there are two ways we can do this — either based on common HYFTs™ or sequence alignments.

Read more on the power of HYFTs™ with our free ebook

1. Domain identification based on common HYFTs™

BioStrand’s proprietary HYFT™ patterns represent a unique signature sequence in DNA, RNA and AA and comprise multiple layers of information, relating to function, structure, position, etc. The more HYFTs™ there are between the queried sequence and the results, the more homologous the sequences.

When we click through to the List view, in the Ranking field we can see a list of the number of shared HYFTs™ between the query sequence and each result. From here, we can select the highest ranking or lowest ranking sequence to filter the results and check how the GO terms compare for each sequence.

To access a detailed overview for a specific sequence, you can click on the row to access the detailed description and associated information.

2. Domain identification based on sequence alignments

Here, we click through to the Alignment view to get a detailed heatmap representation of homologies towards the top of the page.

We can use the colour-coded heatmap to visually assess homologies with the query sequence — the sequences with the highest/most homologies are situated at the blue end of the heatmap, which gradually transitions to green as the homologies dwindle. This colour-coded representation enables us to quickly and visually identify areas with maximum overlap even for proteins associated with multiple domains.

Within the same view, we can also access more detailed sequence-related information, like related domains and functions, by simply clicking on any specific sequence.

We can also click on the amino acids and, by holding down the shift key, select part of a specific sequence to either, 1). copy the sequence segment, 2). directly launch a new search based on our selection, or 3). see what sequences share this pattern. With this feature, we can zoom in on specific regions and find out what domains are specifically associated with this homologue.

The BioStrand R&R protein characterisation advantage

With BioStrand R&R, we can combine all publicly available protein databases, as well as proprietary protein research data, into one unified analytical workflow. We can also focus our research on specific protein datasets that represent the most potential for our unique research objectives.

With one-click access to all relevant results, we can also quickly locate data sources that account for the maximum number of related results. And finally, using either HYFTs™ or sequence alignment techniques, we can shortlist the results based on maximum homology and then drill down further for more detailed sequence and domain information.

Register for future blogs

Originally published at https://blog.biostrand.be.

--

--

BioStrand (a subsidiary of IPA)

Software and proprietary solutions for MULTI-omics data analysis. Effective research requires convenient and scalable tools.