Natural Language Processing (NLP) is growing in use and plays a vital role in many systems from resume-parsing for hiring to automated telephone services. You can also find it in commonly used technology such as chatbots, virtual assistants, and modern spam detection. However, the development and implementation of NLP technology is not as equitable as it may appear.
To put it into perspective, although there are over 7000 languages spoken around the world, the vast majority of NLP processes amplify seven key languages: English, Chinese, Urdu, Farsi, Arabic, French, and Spanish.
Even among these seven languages, the vast majority of technological advances have been achieved in English-based NLP systems. For example, optical character recognition (OCR) is still limited for non-English languages. And anyone who has used an online automatic translation service knows the severe limitations once you venture beyond the key languages referenced above.
How are NLP Pipelines Developed?
To understand the language disparity in NLP, it is helpful to first understand how these systems are developed. A typical pipeline starts by gathering and labeling data. A large dataset is essential here as data will be needed to both train and test the algorithm.
When a pipeline is developed for a language with little available data, it is helpful to have strong patterns within the language. Small datasets can be augmented by techniques such as synonym replacement to simplify the language, back-translation to create similarly phrased sentences to bulk up the dataset, and replacing words with other related parts of speech.
Language data also requires significant cleaning. When a non-English language with special characters is used, like Chinese, proper unicode normalization is typically required. This allows the text to be converted to a binary form that is recognizable by all computer systems, reducing the risk of processing errors. This issue is amplified for languages like Croatian, that rely heavily on accentuation to change the meaning of a word. For example, in Croatian, a single accent can change a positive word into a negative one. Therefore, these terms have to be manually encoded to ensure a robust dataset.
Finally, the dataset can be divided into a training and testing split, and sent through the machine learning process of feature engineering, modeling, evaluation, and refinement.
One commonly used NLP tool is Google’s Bidirectional Encoder Representations from Transformers (BERT) which is purported to develop a “state of the art” model in 30 minutes using a single tensor processing unit. Their GitHub page reports that the top 100 languages with the largest Wikipedia databases are supported, but the actual evaluation and refinement of the system has only been performed on 15 languages. While BERT technically does support more languages, the lower level of accuracy and lack of proper testing limits the applicability of this technology. Other NLP systems such as Word2Vec and the Natural Language Toolkit (NLTK) have similar limitations.
In summary, the NLP pipeline is a challenge for less popular languages. The datasets are smaller, they often need augmentation work, and the cleaning process requires time and effort. The less access you have to native-language resources, the less data available when building an NLP pipeline. This makes the barrier to entry for less popular languages very high, and in many cases, too high.
The Importance of Diverse Language Support in NLP
There are three overarching perspectives that support the expansion of NLP:
- The reinforcement of social disadvantages
- Normative biases
- Language expansion to improve ML technology
Let’s look at each in more detail:
Reinforcement of Social Disadvantages
From a societal perspective, it is important to keep in mind that technology is only accessible when its tools are available in your language. On a basic level, the lack of spell-check technology impairs those who speak and write less common languages. This disparity rises up the technological chain.
What’s more, psychological research has shown that the language you speak molds the way you think. An in-built language preference in the systems that drive the internet inherently incorporates the societal norms of the driving languages.
The fact is that supported systems continue to thrive while it is challenging to introduce new aspects to a deeply ingrained program. This means that as NLP continues to develop without bringing in a diverse range of languages, it will be more challenging to incorporate them in the future, endangering the global variety of languages.
English and English-adjancent languages are not representative of other world languages as they have unique grammatical structures that many languages do not share. By supporting mostly English languages, however, the Internet and other technologies are progressively treating English as the normal, default language setting.
As a relatively agnostic system is trained on English, it learns the norms and systems of a specific language and all the cultural implications that come with that limitation. This single sided approach will only continue to become more apparent as NLP is applied to more intelligent processes that have an international audience.
Language Expansion to Improve ML Technology
When we apply machine learning techniques to only a handful of languages, we program implicit bias into the systems. As machine learning and NLP continues to advance while only supporting a few languages, we not only make it more challenging to introduce new languages, but run the risk of making it fundamentally impossible to do so.
For example, subword tokenization implementation performs very poorly on languages that feature reduplication, a feature common to many international languages such as Afrikaans, Irish, Punjabi, and Armenian.
Languages also have a variety of word order norms, which tend to stump the common neural models used in English-based NLP.
What Can Be Done?
In the current discourse around NLP, when the words “natural language” are spoken, the general assumption is that the researcher is working on an English database. To break out of this mold and create more awareness for international systems, we should first always refer to the language system being developed. This idea of always stating the language a researcher is working on is colloquially referred to as the Bender Rule.
Simple awareness of the issue, of course, is not enough. However, being mindful of the issue can help in the development of more broadly applicable tools.
When looking to introduce more languages into an NLP pipeline, it is also important to consider the size of the dataset. If you are creating a new dataset, a significant portion of your budget should be applied to creating a dataset in another language. Of course, additional research in optimizing current cleaning and annotation programs in other languages is also vital to broadening NLP technologies around the globe.
Be sure to check the related resources below for more technical articles, and sign up to the Lionbridge AI newsletter for interviews and articles delivered directly to your inbox.
Subscribe to our newsletter for more technical articles
Mehrnaz is a passionate data analyst who loves applying technical skills to the health sector. She is a certified health coach and public health researcher who is constantly looking for innovative ways to apply ML. She writes about data science on Medium.