Attention mechanisms, transformers and NLP

BioStrand (a subsidiary of IPA)
5 min readJul 11, 2022


Natural Language Processing is a multidisciplinary field and over the years several models and algorithms have been successfully used to parse text. ML approaches have been central to NLP development with many of them particularly focussing on a technique called sequence-to-sequence learning (Seq2seq).

Deep Neural Networks

First introduced by Google in 2014, Seq2seq models revolutionized translation and were quickly being used for a variety of NLP tasks including text summarization, speech recognition, image captioning, question-answering etc. Prior to this, Deep Neural Networks (DNNs) had been used to tackle difficult problems such as speech recognition.

However, they suffered from a significant limitation in that they required the dimensionality of inputs and outputs to be known and fixed. Hence, they were not suitable for sequential problems, such as speech recognition, machine translation and question answering, where dimensionality can not be pre-defined.

As a result, recurrent neural networks (RNNs), a type of artificial neural network, soon became the state of the art for sequential data.

Recurrent Neural Networks

In a traditional DNN, the assumption is that inputs and outputs are independent of each other. RNNs, however, operate on the principle that the output depends on both the current input as well as the “memory” of previous inputs from a sequence. The use of feedback loops to process sequential data allows information to persist thereby giving RNNs their “memory.” As a result, this approach is perfectly suitable for language applications where context is vital to the accuracy of the final output.

However, there was the issue of vanishing gradients — information loss when dealing with long sequences because of their ability to only focus on the most recent information — that impaired meaningful learning in the context of large data sequences. RNNs soon evolved into several specialized versions, like LSTM (long short-term memory), GRU (Gated Recurrent Unit), Time Distributed layer, and ConvLSTM2D layer, with the capability to process long sequences.

Each of these versions was designed to address specific situations, for instance, GRUs outperformed LSTMs on low complexity sequences, consumed less memory and delivered faster results whereas LSTMs performed better with high complexity sequences and enabled higher accuracy.

RNNs and their variants soon became state-of-the-art for sequence translation. However, there were still several limitations related to long-term dependencies, parallelization, resource intensity and their inability to take full advantage of emerging computing paradigms devices such as TPUs and GPUs. However, a new model would soon emerge and go on to become the dominant architecture for complex NLP tasks.


By 2017, complex RNNs and variants became the standard for sequence modelling and transduction with the best models incorporating an encoder and decoder connected through an attention mechanism. That year, however, a paper from Google called Attention Is All You Need proposed a new model architecture called the Transformer based entirely on attention mechanisms.

Having dropped recurrence in favour of attention mechanisms, these models performed remarkably better at translation tasks, while enabling significantly more parallelization and requiring less time to train.

What is the attention mechanism?

The concept of attention mechanism was first introduced in a 2014 paper on neural machine translation. Prior to this, RNN encoder-decoder frameworks encoded variable-length source sentences into fixed-length vectors that would then be decoded into variable-length target sentences. This approach not only restricts the network’s ability to cope with large sentences but also results in performance deterioration for long input sentences.

Rather than trying to force-fit all the information from an input sentence into a fixed-length vector, the paper proposed the implementation of a mechanism of attention in the decoder. In this approach, the information from an input sentence is encoded across a sequence of vectors, instead of a fixed-length vector, with the attention mechanism allowing the decoder to adaptively choose a subset of these vectors to decode the translation.

Types of attention mechanisms

The Transformer was the first transduction model to implement self-attention as an alternative to recurrence and convolutions. A self-attention, or intra-attention, mechanism relates to different positions in order to compute a representation of the sequence. And depending on the implementation there can be several types of attention mechanisms.

For instance, in terms of source states that contribute to deriving the attention vector, there is global attention, where attention is placed on all source states, hard attention, just one source state and soft attention, a limited set of source states.

There is also Luong attention from 2015, a variation on the original Bahdanau or additive attention, which combined two classes of mechanisms, one global for all source words and the other local and focused on a selected subset of words, to predict the target sentence.

The 2017 Google paper introduced scaled dot-product attention, which itself was like dot-product, or multiplicative, attention, but with a scaling factor. The same paper also defined multi-head attention, where instead of performing a single attention function it is performed in parallel. This approach enables the model to concurrently attend to information from different representation subspaces at different positions.

Multi-head attention has played a central role in the success of Transformer models, demonstrating consistent performance improvements over other attention mechanisms. In fact, RNNs that would typically underperform Transformers have been shown to outperform them when using multi-head attention. Apart from RNNs, they have also been incorporated into other models like Graph Attention Networks and Convolutional Neural Networks.

Transformers in NLP

Transformer architecture has become a dominant choice in NLP. In fact, some of the leading language models for NLP, such as Bidirectional Encoder Representations from Transformers (BERT), Generative Pre-Training Models (GPT-3), and XLNet are transformer-based.

In fact, transformer-based pretrained language models (T-PTLMs) have been successfully used in a variety of NLP tasks. Built on transformers, self-supervised learning and transfer learning, T-PTLMs are able to use self-supervised learning on large volumes of text data to understand universal language representations and then transfer this knowledge to downstream tasks.

Today, there is a long list of T-PTLMs including general, social Media, monolingual, multilingual and domain-specific T-PTLMs. Specialized biomedical language models, like BioBERT, BioELECTRA, BioALBERT and BioELMo, have been able to produce meaningful concept representations that augment the power and accuracy of a range of bioNLP applications such as named entity recognition, relationship extraction and question answering.

Transformer-based language models trained with large-scale drug-target interaction (DTI) data sets have been able to outperform conventional methods in the prediction of novel drug-target interactions. It’s hard to tell if Transformers will eventually replace RNNs but they are currently the model of choice for NLP.

Register for future blogs

Originally published at



BioStrand (a subsidiary of IPA)

Software and proprietary solutions for MULTI-omics data analysis. Effective research requires convenient and scalable tools.