What are transformers?
Transformers are a relatively new type of neural network aimed at solving sequences with easy processing of long-range dependencies. It is the most advanced natural speech processing (NLP) technique to date.
They can be used to translate text, write poetry and articles, and even generate computer code. Unlike recurrent neural networks (RNNs), transformers do not process sequences in order. For example, if the original data is text, then it does not need to process the end of the sentence after processing the beginning. Thanks to this, such a neural network can be parallelized and trained much faster.
When did they appear?
Transformers were first described by engineers from Google Brain at work “Attention Is All You Need” in 2017.
One of the main differences from existing data processing methods is that the input sequence can be transmitted in parallel so that the GPU can be efficiently used, as well as increase the learning rate.
Why do we need transformers?
Until 2017, engineers used deep learning to understand text using recurrent neural networks.
For example, when translating a sentence from English into Russian, RNN will accept an English sentence as input, process the words one at a time, and then sequentially display their Russian counterparts. The key word here is “consistent”. In a language, word order is important, and you can’t just mix them up.
This is where RNNs face a number of challenges. First, they try to process large sequences of text. By the time they move to the end of the paragraph, they “forget” the content of the beginning. For example, an RNN-based translation model may have problems remembering the gender of a long text object.
Second, RNNs are difficult to train. As you know, they are prone to the so-called problem vanishing / exploding gradient…
Third, they process words sequentially; a recurrent neural network is difficult to parallelize. This means that it is impossible to speed up learning using more GPUs. Therefore, it cannot be trained on a large amount of data.
How do transformers work?
The main components of transformers are an encoder and a decoder.
The encoder converts the input information (for example, text) and converts it to a vector (set of numbers). The decoder, in turn, decrypts it in the form of a new sequence (for example, the answer to a question) of words in another language – depending on what purpose the neural network was created for.
Other innovations at the heart of Transformers boil down to three core concepts:
- positional encoders (Positional Encodings);
- attention (Attention);
- self-attention (Self-Attention).
Let’s start with the first one, the positional encoders. Let’s say you need to translate a text from English into Russian. RNN Standard Models “understand” the word order and process them sequentially. However, this makes it difficult to parallelize the process.
Positional encoders overcome this barrier. The idea is to take all the words in the input sequence – in this case an English sentence – and add a number to each in its order. So, you “feed” the network with the following sequence:
[(“Red”, 1), (“fox”, 2), (“jumps”, 3), (“over”, 4), (“lazy”, 5), (“dog”, 6)]
Conceptually, this can be seen as shifting the burden of understanding word order from the structure of the neural network to the data itself.
At first, before transformers learn from any information, they don’t know how to interpret these positional encodings. But as the model sees more and more examples of sentences and their encodings, it learns to use them effectively.
The structure presented above is oversimplified – the authors of the original study used sinusoidal functions to come up with positional encodings, not prime integers 1, 2, 3, 4, but the essence is the same. By keeping the word order as data rather than structure, the neural network is easier to train.
Attention is the structure of a neural network, introduced into the context of machine translation in 2015… To understand this concept, let’s turn to the original article.
Let’s imagine that we need to translate a phrase into French:
“The agreement on the European Economic Area was signed in August 1992”.
The French equivalent of the expression is:
“L’accord sur la zone économique européenne a été signé en août 1992”.
The worst-case translation option is to directly search for analogs of words from English in French, one by one. This cannot be done for several reasons.
First, some words in the French translation are reversed:
“European Economic Area” against “La zone économique européenne”…
Secondly, the French language is rich in gender words. To fit a feminine object “La zone”, adjectives “Économique” and “Européenne” it must also be put in the feminine gender.
Attention helps to avoid these situations. Its mechanism allows the textual model to “look” at each word in the original sentence when deciding how to translate them. This is demonstrated by a visualization from the original article:
It’s kind of heat map, showing what the model “pays attention to” when it translates each word in a French sentence. As you would expect when the model outputs the word “Européenne”, it pretty much accounts for both input words – “European” and “Economic”…
Training data helps the models learn which words to “pay attention” to at each step. By observing thousands of English and French sentences, the algorithm learns the interdependent types of words. He learns to take into account gender, plurality and other grammar rules.
The attention engine has been an extremely useful tool for natural language processing since its discovery in 2015, but in its original form it was used in conjunction with recurrent neural networks. As such, the 2017 transformer article innovation was aimed in part at eliminating RNN altogether. This is why the 2017 work is called Attention Is All You Need.
The last part of transformers is a turn of attention called “self-attention.”
If attention helps to align words when translating from one language to another, then self-attention allows the model to understand the meaning and patterns of the language.
For example, consider these two sentences:
“Nikolai lost his car key”
“The crane key headed south”
Word “key” here means two very different things that we humans, knowing the situation, can easily distinguish between their meanings. Self-attention allows the neural network to understand a word in the context of the words around it.
So when the model processes the word “key” in the first sentence, she may pay attention to “cars” and understand that we are talking about a metal rod of a special shape for the lock, and not something else.
In the second sentence, the model can pay attention to words “Crane” and “south”to refer “key” to a flock of birds. Self-attention helps neural networks disambiguate words, do part-of-speech marking, explore semantic roles, and more.
Where are they used?
Transformers were originally positioned as a neural network for processing and understanding natural language. In the four years since their inception, they have gained popularity and appeared in a variety of services used by millions of people every day.
One of the simplest examples is language model BERT by Google, developed in 2018.
On October 25, 2019, the tech giant announced the beginning of the use of the algorithm in the English-language version of the search engine in the United States. After a month and a half, the company expanded the list of supported languages up to 70, including Russian, Ukrainian, Kazakh and Belarusian.
The original English language model was trained on an 800 million word BooksCorpus dataset and Wikipedia articles. Basic BERT contained 110 million parameters and extended BERT 340 million.
Another example of a popular transformer-based language model is OpenAI’s Generative Pre-trained Transformer (GPT).
Today the most current version of the model is GPT-3. It was trained on a 570 GB dataset, and the number of parameters was 175 billion, which makes it one of the largest language models.
GPT-3 can generate articles, answer questions, be used as a basis for chatbots, perform semantic searches, and create snapshots of texts.
Also, on the basis of GPT-3, an AI assistant was developed for automatically writing GitHub Copilot code. It is based on a special version of the GPT-3 Codex AI, trained on a set of data from lines of code. Researchers have already estimated that since the August 2021 release, 30% of new code on GitHub has been written using Copilot.
In addition, transforms are increasingly being used in Yandex services, for example, Search, News and Translator, Google products, chat bots, and so on. And the Sberbank company released its own modification of GPT, trained for 600 GB of Russian-language texts.
What are the prospects for transformers?
Today, the potential of transformers is still not revealed. They have already proven themselves well in word processing, but recently this type of neural networks has been considered in other tasks, such as computer vision.
At the end of 2020, CV models performed well in some popular benchmarks, such as object detection on the COGO dataset or image classification on ImageNet.
In October 2020, researchers at Facebook AI Research published an article describing the model Data-efficient Image Transformers (DeiT)based on transformers. According to the authors, they found a way to train the algorithm without a huge set of labeled data and obtained a high accuracy of image recognition – 85%.
In May 2021, experts from Facebook AI Research presented a computer vision algorithm DINO open source, automatically segmenting objects in photos and videos without manual marking. It is also based on transformers, and the segmentation accuracy has reached 80%.
Thus, in addition to NLP, transformers are increasingly finding application in other tasks.
What threats do transformers carry?
In addition to the obvious benefits, NLP transformers pose a number of threats. The creators of GPT-3 more than once statedthat the neural network can be used for massive spam attacks, harassment or misinformation.
In addition, the language model is prone to bias towards certain groups of people. Despite the fact that the developers have reduced the toxicity of GPT-3, they are still not ready to make the tool available to a wide range of developers.
In September 2020, researchers at Middlebury College published report about the risks of radicalization of society associated with the spread of large language models. They noted that GPT-3 demonstrates “significant improvements” in extremist writing over its predecessor, GPT-2. “
One of the “fathers of deep learning” Yang LeKun also criticized the technology. He saidthat many expectations about the capabilities of large language models are unrealistic.
“Trying to build intelligent machines by scaling language models is like building airplanes to fly to the moon. You can break altitude records, but going to the moon will require a completely different approach, “LeCun wrote.
Found a mistake in the text? Select it and press CTRL + ENTER