NLP — Are you up to the challenge?
A 2021 Natural Language Processing (NLP) industry survey found a significant year-on-year increase in NLP budgets across businesses from different industries, company sizes, geographic locations and stages of NLP adoption.
According to the study, the top four industries leading NLP adoption were healthcare, technology, education, and financial services. There were, however, some minor variations in primary use case priorities.
For instance, while healthcare’s prerogatives were entity linking/knowledge graphs and de-identification, tech companies were more focused on named entity recognition (NER) and cited document classification. Despite the increase in NLP adoption and investments, accuracy continued to be the top concern and the most important requirement when evaluating an NLP solution.
Why is accuracy in NLP so confounding especially when AI programs have been able to master complex games like Go, evolve into general-purpose algorithms that can master multiple games with even less predefined knowledge and moved on to mastering real-world tasks?
Key NLP challenges in mastering language complexity
Language complexity continues to be a popular research area though there still seems to be no consensus on the best metrics to measure language complexity. Over the last two decades, a number of different complexity measures have emerged either at various linguistic levels, morphology, syntax, or phonology or at a broader structural level.
Recent research has shown some progress in quantifying the complexity of different grammatical components of human languages. For instance, there seems to be a correlation between different types of grammatical complexity and the number of speakers in a particular language.
But at the same time, it has also emerged that the process of normal transmission, by which children learn their first language, not only passes along but actually expands linguistic complexity from generation to generation.
It is therefore quite easy to appreciate the challenges facing NLP, despite its constantly evolving sophistication, in processing human language. Here are some conceptual linguistic obstacles that NLP has to tackle on the route to fool-proof accuracy.
NLP language models fundamentally work by understanding the hierarchy of linguistic diction between words and converting it into a computer-interpretable format. However, the same words and phrases can take on different meanings based on the context within which they are presented.
For instance, the phrases “on thin ice” and “walk in the park” could refer to two completely dissimilar contexts that are easily distinguished by humans. But even though NLP models may have learned all the word definitions, differentiating between them in context can still pose challenges.
Differentiating between homonymy, synonymy, and polysemy
NConsider homonyms, words similar in pronunciation or spelling but with different meanings, synonyms, completely different words with similar meanings, and polysemy, the same words with multiple meanings. Homonyms can be further deconstructed into homophones (pray/prey) and homographs (bass: low, deep sound OR type of fish).
Then there are some words that are examples of both polysemy and homonyms. Though it is natural for us to parse each of these words based on context, current language models may still find it challenging to extract meaning from such nuances.
Detecting irony and sarcasm
Note the linguistics professor proclaiming to his class that there wasn’t a single language in which a double positive expressed a negative only to be met by an irreverent “ Yeah, right.” from the backbenches. So was that irony, sarcasm or satire?
The challenge here each of those concepts carries an “opposite” connotation that is normally also tempered by humour. Detecting the tone, therefore, becomes more important than interpreting the logical meaning of words and sentences. There is this natural language ambiguity inherent in irony, sarcasm, puns, and jokes where the tone imputes a meaning that is quite opposite to what the sentence actually suggests.
Most words, even the unambiguous, are ambiguous in the sense that their meaning can often be context-dependent. Then there is lexical ambiguity where words can have multiple meanings. This is followed by syntactic or structural ambiguity which affects the process of parsing to determine the hierarchical structure behind the linear sequence of words.
Even after lexical and structural ambiguities are clarified, there can still be semantic ambiguity in terms of how a sentence can be interpreted. Over and above all this, there can also be anaphoric ambiguity where a phrase or word could be referring back to one of several articles or objects mentioned earlier in the sentence.
As a result, ambiguity is one of the biggest challenges in NLP as there is a range of factors and variables that influence and determine meaning.
Understanding language- and domain-specific vocabularies and rules
Every industry has its own vocabulary and linguistic variation not only compared with each other but also relative to general text. For instance, it has been shown that lexical ambiguity in medical texts is quantitatively and qualitatively distinct from that of general texts. Much of the information about these quantitative and qualitative distinctions between domain-specific linguistics cannot often be explicitly defined.
Then there is the challenge of dealing with over 6500 languages, each with its own linguistic rules. As a result, we are still in the process of developing language models specific to multiple languages and domains and are at least years away from a general-purpose NLP that can scale across all linguistic rules, variations, ambiguities and complexities.
The future of NLP
A vast majority of the data generated today is in the form of unstructured text that falls outside the processing capabilities of conventional data analytics tools and frameworks. In a data-centric world, that is a lot of latent data and, therefore, unrealized opportunities and innovations.
NLP technologies will be critical to extracting value from large volumes of unstructured data. In biomedical research, for example, NLP can automate and accelerate the integration and analysis of statistical and biological information from large volumes of text including scientific literature and medical/clinical data.
Today, NLP is one of the fastest-growing sectors within AI/ML. Though there are some challenges to be addressed, NLP research continues to progress rapidly across multiple impactful areas.
Register for future blogs
Originally published at https://blog.biostrand.be.