The trials and tribulations of data in NLP

5 min readJun 29, 2022

NLP challenges can be classified into two broad categories. The first category is linguistic and refers to the challenges of decoding the inherent complexity of human language and communication. We covered this category in a recent Why Is NLP Challenging? article.

The second is data-related and refers to some of the data acquisition, accuracy, and analysis issues that are specific to NLP use cases. In this article, we will look at four of the most common data-related challenges in NLP.

Low resource languages

There is currently a digital divide in NLP between high resource languages, such as English, Mandarin, French, German, Arabic, etc., and low resource languages, which include most of the remaining 7,000+ languages of the world. Though there is a range of ML techniques that can reduce the need for labelled data, there still needs to be enough data, both labelled and unlabelled, to feed data-hungry ML techniques and to evaluate system performance.

In recent times, multilingual language models (MLLMs) have emerged as a viable option to handle multiple languages in a single model. Pretrained MLLMs have been successfully used to transfer NLP capabilities to low-resource languages. As a result, there is increasing focus on zero-shot transfer learning approaches to building bigger MLLMs that cover more languages, and on creating benchmarks to understand and evaluate the performance of these models on a wider variety of tasks.

Apart from transfer learning, there are a range of techniques, like data augmentation, distant & weak supervision, cross-lingual annotation projections, learning with noisy labels, and non-expert support, that have been developed to generate alternative forms of labelled data for low-resource languages and low-resource domains. Today, there is even a no-code platform that allows users to build NLP models in low-resource languages.

Training Data

Building accurate NLP models requires huge volumes of training data. Though there has been a sharp increase in recent times of NLP datasets, these are often collected through automation or crowdsourcing. There is, therefore, the potential for incorrectly labelled data which, when used for training, can lead to memorisation and poor generalisation. Apart from finding enough raw data for training, the key challenge is to ensure accurate and extensive data annotation to make training data more reliable.

Data annotation broadly refers to the process of organising and annotating training data for specific NLP use cases. In-text annotation, a subset of data annotation, text data is transcribed and annotated so that ML algorithms are able to make associations between actual and intended meanings.

There are five main techniques for text annotation: sentiment annotation, intent annotation, semantic annotation, entity annotation, and linguistic annotation. However, there are several challenges that each of these has to address. For instance, data labelling for entity annotations typically has to contend with issues related to nesting annotations, introducing new entity types in the middle of a project, managing extensive lists of tags, and categorising trailing and preceding whitespaces and punctuation.

Currently, there are several annotation and classification tools for managing NLP training data at scale. However, manually-labelled gold standard annotations remain a prerequisite and though ML models are increasingly capable of automated labelling, human annotation becomes essential in cases where data cannot be auto-labelled with high confidence.

Large or multiple documents

Dealing with large or multiple documents is another significant challenge facing NLP models. Most NLP research is about benchmarking models on small text tasks and even state-of-the-art models have a limit on the number of words allowed in the input text. The second problem is that supervision is scarce and expensive to obtain. As a result, scaling up NLP to extract context from huge volumes of medium to long unstructured documents remains a technical challenge.

Current NLP models are mostly based on recurrent neural networks (RNNs) that cannot represent longer contexts. However, there is a lot of focus on graph-inspired RNNs as it emerges that a graph structure may serve as the best representation of NLP data. Research at the intersection of DL, graphs and NLP is driving the development of graph neural networks (GNNs). Today, GNNs have been applied successfully to a variety of NLP tasks, from classification tasks such as sentence classification, semantic role labelling and relation extraction, to generation tasks like machine translation, question generation, and summarisation.

Development time and resources

As we mentioned in our previous article regarding the linguistic challenges of NLP, AI programs like AlphaGo have evolved quickly to master a broader variety of games with less predefined knowledge. But NLP development cycles are yet to see that pace and degree of evolution.

That’s because human language is inherently complex as it makes “ infinite use of finite means” by enabling the generation of an infinite number of possibilities from a finite set of building blocks. The prevalent shape of syntax of every language is the result of communicative needs and evolutionary processes that have developed over thousands of years. As a result, NLP development is a complex and time-consuming process that requires evaluating billions of data points in order to adequately train AI from scratch.

Meanwhile, the complexity of large language models is doubling every two months. A powerful language model like the GPT-3 packs 175 billion parameters and requires 314 zettaflops, 1021 floating-point operations, to train. It has been estimated that it would cost nearly $100 million in deep learning (DL) infrastructure to train the world’s largest and most powerful generative language model with 530 billion parameters. In 2021, Google open-sourced a 1.6 trillion parameter model and the projected parameter count for GPT-4 is about 100 trillion. As a result, language modelling is quickly becoming as economically challenging as it is conceptually complex.

Scaling NLP

NLP continues to be one of the fastest-growing sectors within AI. As the race to build larger transformer models continues, the focus will turn to cost-effective and efficient means to continuously pre-train gigantic generic language models with proprietary domain-specific data. Even though large language models and computational graphs can help address some of the data-related challenges of NLP, they will also require infrastructure on a whole new scale. Today, vendors like NVIDIA are offering fully packaged products that enable organisations with extensive NLP expertise but limited systems, HPC, or large-scale NLP workload expertise to scale-out faster. So, despite the challenges, NLP continues to expand and grow to include more and more new use cases.

Register for future blogs

Originally published at https://blog.biostrand.ai.